Dear James, No need to use strong language like in the answer I received below. Some of us are occasional users (thanks Bob for the apologising email).
Furthermore you had all the info to solve the issue, see below. Clue: I found a mail in the gnu mail-list from another user called Kevin that encountered exactly the same problem as me (mail in Apr 2008) and received the following answer from Bob: kevin wrote: > I want to use join command with this 2 files : > test1: > 1 a > 2 a > 3 a > 45 a > 78 a > 152 a > 1896 a The input files to join must be sorted. The above is not. Please see this reference for more information. http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#join-requir es-sorted-input-files Bob Hence why are those data not sorted ???? (the link given is totally useless to understand why the data are not sorted) Answer: Like mine, these data were NUMERICALLY sorted (sort -n ). But as experimentally found while trying to solve this issue the join command needs the files to be alpha-numerically sorted (= the default sort) but absolutely not numerically sorted. Hence: 1. Because the sort command is a very versatile one, GNU could need to be more precise in their answers. 2. Suggestion: To add the line: 'Default (alpha-numerical) sort required (avoid sort -n)' in the join command manual and --help to help future users. Sincerely yours, Guillaume -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of James Youngman Sent: 21 August 2008 21:57 To: Guillaume Cc: bug-coreutils@gnu.org Subject: Re: UNIX join command bug > Dear GNU, > > > I have two files exactly identical composed of: > > 6 Fields, tab separated, with a /n That would be \n - I assume you mean ASCII LF. > at the end of the line, sorted > numerically on the key identifier (field #2). > > > Here is the head of the files: > > > File1 > > CHR SNP A1 A2 MAF NCHROBS > 13 rs4 G A 0.0648148 216 > 7 rs8 T C 0.166667 216 > 7 rs16 T C 0.475962 208 > ... > > > File2 > > CHR SNP A1 A2 MAF NCHROBS > 7 rs8 A G 0.215674 9876 > 7 rs16 G A 0.477102 9870 > 7 rs19 G A 0.385628 9880 > ... > > > > The first file is ~ 1,400,000 lines long > > The second file is ~ 330,000 lines long You're not making it easy for people to help you. You don't indicate what version of coreutils you are using. You don't provide a minimal example. You just tell us you have two vast inputs you won't show us that don't join in the way you expect. > When I perform a very simple join command as follows: > > Join -1 2 -2 2 file1.txt file2.txt > joinedfile.txt > > > I obtain a joinedfile of ~213.000 lines in place of the expected > ~322.000 lines (65% of the lines). > > The lines missing are scattered everywhere in the original files (at the > beginning, middle or end). There is also no logic to find while > considering the SNP identifier of the missing lines. > > > > For example a line which is missing is the following one: This is not a helpful example; 99% of join problems are caused by out-of-order input and you haven't provided a complete example that domenstrates the problem so that we can eliminate that possibility. > I can't find any difference between the files (e.g., no hidden > characters) or the key identifiers. The files are sorted in the same > way, tabulated in the same way,... My guess is that this is not actually the case. > The only difference is the number of lines (1.4 million in file 1; 300 > thousands in file 2). While big, these line numbers should not be a > limiting factor to the join command... (and why would be the missing > line scattered all along the files?) > > > Using a Perl script to print lines having the same field 2 identifier, I > obtain the ~322,000 lines expected proving that it is nearly surely a > join command bug. > > > > Question: Is there any trivial (or less trivial) explanation to this > join command bug? Operator error? Try coreutils 6.11, which should notify you if the input is out of order - see the Info documentation for details. James. -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils