UNIX join command bug

Guillaume Smits Thu, 21 Aug 2008 12:07:28 -0700

Dear GNU,


I have two files exactly identical composed of:

6 Fields, tab separated, with a /n at the end of the line, sorted
numerically on the key identifier (field #2).


Here is the head of the files:


File1

CHR     SNP     A1      A2      MAF     NCHROBS
13      rs4     G       A       0.0648148       216
7       rs8     T       C       0.166667        216
7       rs16    T       C       0.475962        208
...


File2

CHR     SNP     A1      A2      MAF     NCHROBS
7       rs8     A       G       0.215674        9876
7       rs16    G       A       0.477102        9870
7       rs19    G       A       0.385628        9880
...



The first file is ~ 1,400,000 lines long

The second file is ~ 330,000 lines long


There should be ~322,000 lines in common (i.e., with the same SNP
identifier - field #2).


When I perform a very simple join command as follows:

Join -1 2 -2 2 file1.txt file2.txt > joinedfile.txt


I obtain a joinedfile of ~213.000 lines in place of the expected
~322.000 lines (65% of the lines). 

The lines missing are scattered everywhere in the original files (at the
beginning, middle or end). There is also no logic to find while
considering the SNP identifier of the missing lines.



For example a line which is missing is the following one:

File 1

11      rs1535  G       A       0.348624        218


File 2

11      rs1535  G       A       0.440218        9886


As one can see, the key field identifier is identical (rs1535) hence
this line should be printed in the output.



I can't find any difference between the files (e.g., no hidden
characters) or the key identifiers. The files are sorted in the same
way, tabulated in the same way,...


The only difference is the number of lines (1.4 million in file 1; 300
thousands in file 2). While big, these line numbers should not be a
limiting factor to the join command... (and why would be the missing
line scattered all along the files?)


Using a Perl script to print lines having the same field 2 identifier, I
obtain the ~322,000 lines expected proving that it is nearly surely a
join command bug.



Question: Is there any trivial (or less trivial) explanation to this
join command bug?


Thanks for your help,

G



Guillaume Smits
Team 108 - Human Genetics Program
The Sanger Institute

Office: N3-33, Morgan Building
Tel: + 44 (0)1223 834244 (ext 8643)
Email: [EMAIL PROTECTED]



--
 The Wellcome Trust Sanger Institute is operated by Genome Research
 Limited, a charity registered in England with number 1021457 and a
 company registered in England with number 2742969, whose registered
 office is 215 Euston Road, London, NW1 2BE.


_______________________________________________
Bug-coreutils mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/bug-coreutils

UNIX join command bug

Reply via email to