Dear GNU,
I have two files exactly identical composed of: 6 Fields, tab separated, with a /n at the end of the line, sorted numerically on the key identifier (field #2). Here is the head of the files: File1 CHR SNP A1 A2 MAF NCHROBS 13 rs4 G A 0.0648148 216 7 rs8 T C 0.166667 216 7 rs16 T C 0.475962 208 ... File2 CHR SNP A1 A2 MAF NCHROBS 7 rs8 A G 0.215674 9876 7 rs16 G A 0.477102 9870 7 rs19 G A 0.385628 9880 ... The first file is ~ 1,400,000 lines long The second file is ~ 330,000 lines long There should be ~322,000 lines in common (i.e., with the same SNP identifier - field #2). When I perform a very simple join command as follows: Join -1 2 -2 2 file1.txt file2.txt > joinedfile.txt I obtain a joinedfile of ~213.000 lines in place of the expected ~322.000 lines (65% of the lines). The lines missing are scattered everywhere in the original files (at the beginning, middle or end). There is also no logic to find while considering the SNP identifier of the missing lines. For example a line which is missing is the following one: File 1 11 rs1535 G A 0.348624 218 File 2 11 rs1535 G A 0.440218 9886 As one can see, the key field identifier is identical (rs1535) hence this line should be printed in the output. I can't find any difference between the files (e.g., no hidden characters) or the key identifiers. The files are sorted in the same way, tabulated in the same way,... The only difference is the number of lines (1.4 million in file 1; 300 thousands in file 2). While big, these line numbers should not be a limiting factor to the join command... (and why would be the missing line scattered all along the files?) Using a Perl script to print lines having the same field 2 identifier, I obtain the ~322,000 lines expected proving that it is nearly surely a join command bug. Question: Is there any trivial (or less trivial) explanation to this join command bug? Thanks for your help, G Guillaume Smits Team 108 - Human Genetics Program The Sanger Institute Office: N3-33, Morgan Building Tel: + 44 (0)1223 834244 (ext 8643) Email: [EMAIL PROTECTED] -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils