On Sat, Feb 18, 2017 at 5:39 AM, Cook, Malcolm <[email protected]> wrote:
> I don't think my needs were clear. Your needs were clear and I am really surprised that you did not understand the solution I proposed. > I know you are bioinformatics savvy and are familiar with bedtools, so let me > cast my example in terms of bedtools. > > I have a huge sorted bedfile, my.bed, that I want to pipe into bedtools merge > (http://bedtools.readthedocs.io/en/latest/content/tools/merge.html) > > As required, it is sorted already. > > I could > > cat my.bed | parallel -j10 --pipe --block 50M bedtools merge > > but the blocks that my.bed get broken by parallel into might not keep > together the chromosomes, but this is required for the merge to be correct. > > So I am looking for a means to instruct parallel that some ranges of records > must stay together within a block. Yup. You want each chromosome to be treated as a record. So what you do is to insert a record separator before each chromosome and tell GNU Parallel to use that as record separator. Column 0 is the chromosome, so when that changes we insert '\0' which will never be in a normal bedfile. Then we ask GNU Parallel to split records on \0 and remove the \0 before passing it to bedtools. cat my.bed | perl -ape '$F[0] ne $old and print "\0"; $old = $F[0]' | parallel --recend '\0' --rrs --pipe --block 50M -j10 bedtools merge The only thing I have changed from my previous email is: example -> my.bed $F[1] -> $F[0] --block 200 -> --block 50M wc -> bedtools merge and added -j10. I have the feeling you are now saying *DOH*. /Ole
