Re: Parallel Merge

Ole Tange Fri, 19 Aug 2011 06:25:32 -0700

On Wed, Aug 17, 2011 at 1:13 PM, Nathan Watson-Haigh
<[email protected]> wrote:
>
> I just found this tool today and I'm very impressed with it!


Good to see a fellow bioinformatician finds it useful. If you like GNU Parallel:

* Post the intro videos on forums/blogs/Google+/Twitter/Facebook/Linkedin
* Request or write a review for your favourite blog or magazine
* Invite me as a speaker for your next conference

If GNU Parallel saves you money:

* (Have your company) donate to FSF https://my.fsf.org/donate/

> I've seen the example for parallelising sort....is there a way to spread the 
> merge part amoung more cores? I'm trying to do a sort over 200million lines - 
> the merge part takes several minutes to complete.
>
> How do I write a script for this so I can use it as a drop-in replacement for 
> a sort command used in the pipeline of another software tool?

That is an interesting idea. If you 200M lines are in a single file,
you can use 'parallel --pipe'. This will distribute the merge to all
cores. Only the final merge will be done by one core. If bigfile is 1G
you will in total read 3G and write 3G (which is clearly not optimal):

cat bigfiles* | parallel --pipe --files sort | parallel --files -X
sort -m {} ';' rm {} | parallel -Xj1 sort -m {} ';' rm {}
>bigfile.sort

However, --pipe is somewhat slow (less than 100MB/s), and thus may be
a bottleneck. So for an optimal solution we probably need to split up
into several situations.

Number of files: nfiles (1 if reading from stdin (standard input))
Total size of files: sfiles
Memory available: RAM
Number of processes that UNIX allows to be run in parallel: npar
Number of processors: nproc
Number of filenames that fit on the command line: ncmd

If sfiles < RAM then we should read all into RAM and process it there
without temporary files.
If nfiles > nproc and nfiles < npar we can run one process per file
and do the merge on the fly.
If nfiles > nproc and nfiles > npar we group the nfiles into npar
groups. Each group is sorted sequentially but the groups are run in
parallel. The output is merged on the fly.
If nfiles < nproc the content of the files need to be split into nproc
(or more) chunks.

If sfiles > RAM then we need to use temporary files. It should be
possible to only write a temporary file once (so in total to sort 1G
you need to read 2G and write 2G).
If nfiles > nproc and nfiles < npar we can run one process per file
saving each result to a temporary file and finally do the merge of all
the files.
If nfiles > nproc and nfiles > npar we group the nfiles into npar
groups. Each group is sorted sequentially but the groups are run in
parallel with the output saved to a temporary file for each group.
After the sort the tempfiles are merged.
If nfiles < nproc the content of the files need to be split into nproc
(or more) chunks. Each chunk is then sorted, saved to a tempfile and
then merged.

The below tries to let 'sort' deal with the problem of temporary
files. It should scale upwards to npar*ncmd files. It reads a list of
files to sort on stdin (standard input). Thus it does not work for
data on a pipe. I am sure it also has other bugs, but it might be OK
for your purpose.

#!/bin/bash

DIR=$(mktemp -d /tmp/parsort.XXXXX)
cat >$DIR/filenames
# Create a bunch of fifos for sorting into
cat $DIR/filenames | parallel -Xj0 mkfifo $DIR/sort-{#}
# Start the merge sort into the fifos
cat $DIR/filenames | parallel -Xj0 sort {} \>$DIR/sort-{#} &

# Create a bunch of fifos for merging into
parallel -X mkfifo $DIR/merge-{#} ::: $DIR/sort-*
# Multilevel merge
parallel -X sort --batch-size=1000 -m {} \>$DIR/merge-{#} ::: $DIR/sort-* &
# Read and merge from the fifos
sort --batch-size=1000 -m $DIR/merge-*
# Remove the fifos
rm -rf $DIR

For sorting the output from a pipe we need to split the pipe with
--pipe. Unfortunately we do not know how big the input is, so we do
not know how many chunks we have. It would be handy if GNU Parallel
could distribute the input among N processes so when one process is
finished reading the same process will get another chunk.
Unfortunately that is not how GNU Parallel works today.

#!/bin/bash

DIR=$(mktemp -d /tmp/parsort.XXXXX)
parallel --block 100M --pipe --files sort -S110000000 > $DIR/sort-files

# Create a bunch of fifos for merging into
parallel -X mkfifo $DIR/merge-{#} :::: $DIR/sort-files
# Multilevel merge
parallel -X sort --batch-size=1000 -m {} \>$DIR/merge-{#} :::: $DIR/sort-files &
# Read and merge from the fifos
sort --batch-size=1000 -m $DIR/merge-*
# Remove the fifos and the tempfiles
rm -rf $(cat $DIR/sort-files) $DIR


None of the solutions above are perfect for all situations, but for
some situations they should do a reasonable job. On my tests they are
both faster than GNU sort.


/Ole

Re: Parallel Merge

Reply via email to