On Thu, Aug 21, 2008 at 4:45 PM, Guillaume Smits <[EMAIL PROTECTED]> wrote: > Dear GNU, > > > I have two files exactly identical composed of: > > 6 Fields, tab separated, with a /n
That would be \n - I assume you mean ASCII LF. > at the end of the line, sorted > numerically on the key identifier (field #2). > > > Here is the head of the files: > > > File1 > > CHR SNP A1 A2 MAF NCHROBS > 13 rs4 G A 0.0648148 216 > 7 rs8 T C 0.166667 216 > 7 rs16 T C 0.475962 208 > ... > > > File2 > > CHR SNP A1 A2 MAF NCHROBS > 7 rs8 A G 0.215674 9876 > 7 rs16 G A 0.477102 9870 > 7 rs19 G A 0.385628 9880 > ... > > > > The first file is ~ 1,400,000 lines long > > The second file is ~ 330,000 lines long You're not making it easy for people to help you. You don't indicate what version of coreutils you are using. You don't provide a minimal example. You just tell us you have two vast inputs you won't show us that don't join in the way you expect. > When I perform a very simple join command as follows: > > Join -1 2 -2 2 file1.txt file2.txt > joinedfile.txt > > > I obtain a joinedfile of ~213.000 lines in place of the expected > ~322.000 lines (65% of the lines). > > The lines missing are scattered everywhere in the original files (at the > beginning, middle or end). There is also no logic to find while > considering the SNP identifier of the missing lines. > > > > For example a line which is missing is the following one: This is not a helpful example; 99% of join problems are caused by out-of-order input and you haven't provided a complete example that domenstrates the problem so that we can eliminate that possibility. > I can't find any difference between the files (e.g., no hidden > characters) or the key identifiers. The files are sorted in the same > way, tabulated in the same way,... My guess is that this is not actually the case. > The only difference is the number of lines (1.4 million in file 1; 300 > thousands in file 2). While big, these line numbers should not be a > limiting factor to the join command... (and why would be the missing > line scattered all along the files?) > > > Using a Perl script to print lines having the same field 2 identifier, I > obtain the ~322,000 lines expected proving that it is nearly surely a > join command bug. > > > > Question: Is there any trivial (or less trivial) explanation to this > join command bug? Operator error? Try coreutils 6.11, which should notify you if the input is out of order - see the Info documentation for details. James. _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils