Dear James,

No need to use strong language like in the answer I received below. Some
of us are occasional users (thanks Bob for the apologising email). 

Furthermore you had all the info to solve the issue, see below.


I found a mail in the gnu mail-list from another user called Kevin that
encountered exactly the same problem as me (mail in Apr 2008) and
received the following answer from Bob: 

kevin wrote:
> I want to use join command with this 2 files :

> test1:
> 1 a
> 2 a
> 3 a
> 45 a
> 78 a
> 152 a
> 1896 a

The input files to join must be sorted.  The above is not.
Please see this reference for more information.


Hence why are those data not sorted ???? (the link given is totally
useless to understand why the data are not sorted)


Like mine, these data were NUMERICALLY sorted (sort -n ).

But as experimentally found while trying to solve this issue the join
command needs the files to be alpha-numerically sorted (= the default
sort) but absolutely not numerically sorted.


1. Because the sort command is a very versatile one, GNU could need to
be more precise in their answers.

2. Suggestion: To add the line: 'Default (alpha-numerical) sort required
(avoid sort -n)' in the join command manual and --help to help future

Sincerely yours,


> Dear GNU,
> I have two files exactly identical composed of:
> 6 Fields, tab separated, with a /n

That would be \n - I assume you mean ASCII LF.

> at the end of the line, sorted
> numerically on the key identifier (field #2).
> Here is the head of the files:
> File1
> CHR     SNP     A1      A2      MAF     NCHROBS
> 13      rs4     G       A       0.0648148       216
> 7       rs8     T       C       0.166667        216
> 7       rs16    T       C       0.475962        208
> ...
> File2
> CHR     SNP     A1      A2      MAF     NCHROBS
> 7       rs8     A       G       0.215674        9876
> 7       rs16    G       A       0.477102        9870
> 7       rs19    G       A       0.385628        9880
> ...
> The first file is ~ 1,400,000 lines long
> The second file is ~ 330,000 lines long

You're not making it easy for people to help you.    You don't
indicate what version of coreutils you are using.    You don't provide
a minimal example.   You just tell us you have two vast inputs you
won't show us that don't join in the way you expect.

> When I perform a very simple join command as follows:
> Join -1 2 -2 2 file1.txt file2.txt > joinedfile.txt
> I obtain a joinedfile of ~213.000 lines in place of the expected
> ~322.000 lines (65% of the lines).
> The lines missing are scattered everywhere in the original files (at
> beginning, middle or end). There is also no logic to find while
> considering the SNP identifier of the missing lines.
> For example a line which is missing is the following one:

This is not a helpful example; 99% of join problems are caused by
out-of-order input and you haven't provided a complete example that
domenstrates the problem so that we can eliminate that possibility.

> I can't find any difference between the files (e.g., no hidden
> characters) or the key identifiers. The files are sorted in the same
> way, tabulated in the same way,...

My guess is that this is not actually the case.

> The only difference is the number of lines (1.4 million in file 1; 300
> thousands in file 2). While big, these line numbers should not be a
> limiting factor to the join command... (and why would be the missing
> line scattered all along the files?)
> Using a Perl script to print lines having the same field 2 identifier,
> obtain the ~322,000 lines expected proving that it is nearly surely a
> join command bug.
> Question: Is there any trivial (or less trivial) explanation to this
> join command bug?

Operator error?      Try coreutils 6.11, which should notify you if
the input is out of order - see the Info documentation for details.


