RE: Parallel Merge

Cook, Malcolm Fri, 19 Aug 2011 06:57:06 -0700

Yes, but, the latest gnu core utils provides parallelized version of sort.


reading http://www.gnu.org/s/coreutils/manual/html_node/sort-invocation.html

'--parallel=n'
Set the number of sorts run in parallel to n. By default, n is set to the 
number of available processors, but limited to 8, as there are diminishing 
performance gains after that. Note also that using n threads increases the 
memory usage by a factor of log n. Also see nproc invocation.


~Malcolm

> -----Original Message-----
> From: [email protected] [mailto:parallel-
> [email protected]] On Behalf Of Ole Tange
> Sent: Friday, August 19, 2011 8:25 AM
> To: Nathan Watson-Haigh
> Cc: [email protected]
> Subject: Re: Parallel Merge
> 
> On Wed, Aug 17, 2011 at 1:13 PM, Nathan Watson-Haigh
> <[email protected]> wrote:
> >
> > I just found this tool today and I'm very impressed with it!
> 
> Good to see a fellow bioinformatician finds it useful. If you like GNU 
> Parallel:
> 
> * Post the intro videos on
> forums/blogs/Google+/Twitter/Facebook/Linkedin
> * Request or write a review for your favourite blog or magazine
> * Invite me as a speaker for your next conference
> 
> If GNU Parallel saves you money:
> 
> * (Have your company) donate to FSF https://my.fsf.org/donate/
> 
> > I've seen the example for parallelising sort....is there a way to spread the
> merge part amoung more cores? I'm trying to do a sort over 200million lines -
> the merge part takes several minutes to complete.
> >
> > How do I write a script for this so I can use it as a drop-in replacement 
> > for a
> sort command used in the pipeline of another software tool?
> 
> That is an interesting idea. If you 200M lines are in a single file,
> you can use 'parallel --pipe'. This will distribute the merge to all
> cores. Only the final merge will be done by one core. If bigfile is 1G
> you will in total read 3G and write 3G (which is clearly not optimal):
> 
> cat bigfiles* | parallel --pipe --files sort | parallel --files -X
> sort -m {} ';' rm {} | parallel -Xj1 sort -m {} ';' rm {}
> >bigfile.sort
> 
> However, --pipe is somewhat slow (less than 100MB/s), and thus may be
> a bottleneck. So for an optimal solution we probably need to split up
> into several situations.
> 
> Number of files: nfiles (1 if reading from stdin (standard input))
> Total size of files: sfiles
> Memory available: RAM
> Number of processes that UNIX allows to be run in parallel: npar
> Number of processors: nproc
> Number of filenames that fit on the command line: ncmd
> 
> If sfiles < RAM then we should read all into RAM and process it there
> without temporary files.
> If nfiles > nproc and nfiles < npar we can run one process per file
> and do the merge on the fly.
> If nfiles > nproc and nfiles > npar we group the nfiles into npar
> groups. Each group is sorted sequentially but the groups are run in
> parallel. The output is merged on the fly.
> If nfiles < nproc the content of the files need to be split into nproc
> (or more) chunks.
> 
> If sfiles > RAM then we need to use temporary files. It should be
> possible to only write a temporary file once (so in total to sort 1G
> you need to read 2G and write 2G).
> If nfiles > nproc and nfiles < npar we can run one process per file
> saving each result to a temporary file and finally do the merge of all
> the files.
> If nfiles > nproc and nfiles > npar we group the nfiles into npar
> groups. Each group is sorted sequentially but the groups are run in
> parallel with the output saved to a temporary file for each group.
> After the sort the tempfiles are merged.
> If nfiles < nproc the content of the files need to be split into nproc
> (or more) chunks. Each chunk is then sorted, saved to a tempfile and
> then merged.
> 
> The below tries to let 'sort' deal with the problem of temporary
> files. It should scale upwards to npar*ncmd files. It reads a list of
> files to sort on stdin (standard input). Thus it does not work for
> data on a pipe. I am sure it also has other bugs, but it might be OK
> for your purpose.
> 
> #!/bin/bash
> 
> DIR=$(mktemp -d /tmp/parsort.XXXXX)
> cat >$DIR/filenames
> # Create a bunch of fifos for sorting into
> cat $DIR/filenames | parallel -Xj0 mkfifo $DIR/sort-{#}
> # Start the merge sort into the fifos
> cat $DIR/filenames | parallel -Xj0 sort {} \>$DIR/sort-{#} &
> 
> # Create a bunch of fifos for merging into
> parallel -X mkfifo $DIR/merge-{#} ::: $DIR/sort-*
> # Multilevel merge
> parallel -X sort --batch-size=1000 -m {} \>$DIR/merge-{#} ::: $DIR/sort-* &
> # Read and merge from the fifos
> sort --batch-size=1000 -m $DIR/merge-*
> # Remove the fifos
> rm -rf $DIR
> 
> For sorting the output from a pipe we need to split the pipe with
> --pipe. Unfortunately we do not know how big the input is, so we do
> not know how many chunks we have. It would be handy if GNU Parallel
> could distribute the input among N processes so when one process is
> finished reading the same process will get another chunk.
> Unfortunately that is not how GNU Parallel works today.
> 
> #!/bin/bash
> 
> DIR=$(mktemp -d /tmp/parsort.XXXXX)
> parallel --block 100M --pipe --files sort -S110000000 > $DIR/sort-files
> 
> # Create a bunch of fifos for merging into
> parallel -X mkfifo $DIR/merge-{#} :::: $DIR/sort-files
> # Multilevel merge
> parallel -X sort --batch-size=1000 -m {} \>$DIR/merge-{#} :::: 
> $DIR/sort-files &
> # Read and merge from the fifos
> sort --batch-size=1000 -m $DIR/merge-*
> # Remove the fifos and the tempfiles
> rm -rf $(cat $DIR/sort-files) $DIR
> 
> 
> None of the solutions above are perfect for all situations, but for
> some situations they should do a reasonable job. On my tests they are
> both faster than GNU sort.
> 
> 
> /Ole

RE: Parallel Merge

Reply via email to