Thanks!  That is very helpful in that it confirms my sanity.  I didn't know 
about the locale stuff.

I have more info, as yesterday, I'd worked on getting around this issue (but 
hadn't found a complete solution yet).  I have it all working consistently now. 
 There are some interesting things I noted that may help explain things, since 
it *may*(?) not be entirely attributable to the sort strategies...

- It seemed as if the sort error was fatal on linux, as it never output 
anything, though when I added `set euxo pipefail`, join on both systems output 
nothing - and join on both systems exited non-zero (according to the command 
echo output).
- Adding `--nocheck-order` then allowed both OS's join to generate output and 
exit successfully, but the output on linux unexpectedly included the comment at 
the top of the file (which wasn't included in the macOS output, and caused 
problems downstream of the join command).

Then once I added a `grep -v "#"` to the input files that join consumed, join 
on both systems started generating the same output.

The upstream command (featureCounts) was reliably generating the same order 
output with the same number of lines, so we didn't really need the order check, 
otherwise, I would have added the locale setting you suggested - but it's 
definitely good to know.

As a sample, here is the top of one of the input files that was going into the 
join command, and all input files only differed by column 7:

```
# Program:featureCounts v2.0.6; Command:"featureCounts" "-p" "-T" "4" "-g" 
"peak" "-t" "peak" "-F" "GTF" "--fracOverlap" "0.3" "-a" 
"results/all_atac_peaks.gtf" "-o" 
"results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv" 
"results/sorted_atac_alignments/SRR17656980_19_60m_end.bam" 
Geneid  Chr     Start   End     Strand  Length  
results/sorted_atac_alignments/SRR17656980_19_60m_end.bam
peak1   19      9991    10388   .       398     22
peak2   19      13446   13578   .       133     1
peak3   19      18765   19285   .       521     1072
peak4   19      19695   19840   .       146     457
peak5   19      24499   24928   .       430     1086
peak6   19      33839   34320   .       482     169
peak7   19      34412   34567   .       156     36
peak8   19      34653   35197   .       545     212
```

Rob

Robert William Leach
133 Carl C. Icahn Lab
Lewis-Sigler Institute for Integrative Genomics
Princeton University
Princeton, NJ 08544

> On Jul 11, 2023, at 5:48 AM, Pádraig Brady <p...@draigbrady.com> wrote:
> 
> On 10/07/2023 22:26, Robert Leach via GNU coreutils General Discussion wrote:
>> Hi,
>> I wanted to ask about the `join` utility in `coreutils` 9.3.  I'm building a 
>> snakemake workflow and am debugginbg an error that only occurs when the 
>> workflow is run on a linux system.  I have narrowed the difference down to 
>> the `join` utility provided by the `coreutils` conda package.  An error is 
>> produced on both systems, but since my script had not set `set -euxo 
>> pipefail`, the error was silent.  On linux, this produced an error in the 
>> workflow rule that executes after the one that uses the join utility, 
>> because the input file was empty.
>> So I manually ran the join command and noticed the difference in behavior on:
>> macOS:
>> ```
>> (coreutils) gen-rl-imac[2023-07-10 
>> 17:01:59]:...CT-LOCAL/YURI/ATACC/REPOS/ATACCompendium$ join -1 1 -2 1 -o 
>> 1.1,1.7,2.7 -t ' ' 
>> .tests/test_1/results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv
>>  
>> .tests/test_1/results/counts/peaks/raw/individual/SRR13509617_19_60m_end_counts.tsv
>> Geneid       results/sorted_atac_alignments/SRR17656980_19_60m_end.bam       
>> results/sorted_atac_alignments/SRR13509617_19_60m_end.bam
>> peak1        22      28
>> peak2        1       12
>> peak3        1072    1637
>> peak4        457     942
>> peak5        1086    1507
>> peak6        169     67
>> peak7        36      85
>> peak8        212     198
>> join: 
>> .tests/test_1/results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv:12:
>>  is not sorted: peak10  19      39038   39248   .       211     194
>> join: 
>> .tests/test_1/results/counts/peaks/raw/individual/SRR13509617_19_60m_end_counts.tsv:12:
>>  is not sorted: peak10  19      39038   39248   .       211     228
>> peak9        39      34
>> peak10       194     228
>> peak11       2178    2778
>> ...
>> join: input is not in sorted order
>> ```
>> and linux:
>> ```
>> (coreutils) [rleach@argo-comp2 ATACCompendium]$ join -1 1 -2 1 -o 
>> 1.1,1.7,2.7 -t '   ' 
>> .tests/test_1/results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv
>>  
>> .tests/test_1/results/counts/peaks/raw/individual/SRR13509617_19_60m_end_counts.tsv
>> join: 
>> .tests/test_1/results/counts/peaks/raw/individual/SRR17656980_19_60m_end_counts.tsv:12:
>>  is not sorted: peak10  19      39038   39248   .       211     194
>> join: 
>> .tests/test_1/results/counts/peaks/raw/individual/SRR13509617_19_60m_end_counts.tsv:2:
>>  is not sorted: Geneid   Chr     Start   End     Strand  Length  
>> results/sorted_atac_alignments/SRR13509617_19_60m_end.bam
>> join: input is not in sorted order
>> ```
>> Is this a bug in either the macOS or linux versions of the coreutils join 
>> utility, a known issue, or what?
> 
> Well the output from join(1) is giving ample clues
> that the input files aren't sorted appropriately.
> 
> Details:
> 
> The above should be warnings and not impact the exit status of the join 
> process.
> 
> The difference in output from Linux and MacOS is probably due to locale 
> settings.
> Note how "Geneid" is the first disorder on your Linux system, which suggests
> MacOS is using the C locale, while your Linux system is using en_US or 
> equivalent.
> So you may get better consistency with the join --header option,
> and that may be enough to address all your issues.
> 
> If --header doesn't suffice, you may need to `LC_ALL=C sort -k1.5n` your 
> input files
> before passing to join.
> 
> If that doesn't suffice, you may get desired operation with the 
> --nocheck-order option.
> 
> cheers,
> Pádraig

Reply via email to