Yes, but, the latest gnu core utils provides parallelized version of sort.
reading http://www.gnu.org/s/coreutils/manual/html_node/sort-invocation.html '--parallel=n' Set the number of sorts run in parallel to n. By default, n is set to the number of available processors, but limited to 8, as there are diminishing performance gains after that. Note also that using n threads increases the memory usage by a factor of log n. Also see nproc invocation. ~Malcolm > -----Original Message----- > From: [email protected] [mailto:parallel- > [email protected]] On Behalf Of Ole Tange > Sent: Friday, August 19, 2011 8:25 AM > To: Nathan Watson-Haigh > Cc: [email protected] > Subject: Re: Parallel Merge > > On Wed, Aug 17, 2011 at 1:13 PM, Nathan Watson-Haigh > <[email protected]> wrote: > > > > I just found this tool today and I'm very impressed with it! > > Good to see a fellow bioinformatician finds it useful. If you like GNU > Parallel: > > * Post the intro videos on > forums/blogs/Google+/Twitter/Facebook/Linkedin > * Request or write a review for your favourite blog or magazine > * Invite me as a speaker for your next conference > > If GNU Parallel saves you money: > > * (Have your company) donate to FSF https://my.fsf.org/donate/ > > > I've seen the example for parallelising sort....is there a way to spread the > merge part amoung more cores? I'm trying to do a sort over 200million lines - > the merge part takes several minutes to complete. > > > > How do I write a script for this so I can use it as a drop-in replacement > > for a > sort command used in the pipeline of another software tool? > > That is an interesting idea. If you 200M lines are in a single file, > you can use 'parallel --pipe'. This will distribute the merge to all > cores. Only the final merge will be done by one core. If bigfile is 1G > you will in total read 3G and write 3G (which is clearly not optimal): > > cat bigfiles* | parallel --pipe --files sort | parallel --files -X > sort -m {} ';' rm {} | parallel -Xj1 sort -m {} ';' rm {} > >bigfile.sort > > However, --pipe is somewhat slow (less than 100MB/s), and thus may be > a bottleneck. So for an optimal solution we probably need to split up > into several situations. > > Number of files: nfiles (1 if reading from stdin (standard input)) > Total size of files: sfiles > Memory available: RAM > Number of processes that UNIX allows to be run in parallel: npar > Number of processors: nproc > Number of filenames that fit on the command line: ncmd > > If sfiles < RAM then we should read all into RAM and process it there > without temporary files. > If nfiles > nproc and nfiles < npar we can run one process per file > and do the merge on the fly. > If nfiles > nproc and nfiles > npar we group the nfiles into npar > groups. Each group is sorted sequentially but the groups are run in > parallel. The output is merged on the fly. > If nfiles < nproc the content of the files need to be split into nproc > (or more) chunks. > > If sfiles > RAM then we need to use temporary files. It should be > possible to only write a temporary file once (so in total to sort 1G > you need to read 2G and write 2G). > If nfiles > nproc and nfiles < npar we can run one process per file > saving each result to a temporary file and finally do the merge of all > the files. > If nfiles > nproc and nfiles > npar we group the nfiles into npar > groups. Each group is sorted sequentially but the groups are run in > parallel with the output saved to a temporary file for each group. > After the sort the tempfiles are merged. > If nfiles < nproc the content of the files need to be split into nproc > (or more) chunks. Each chunk is then sorted, saved to a tempfile and > then merged. > > The below tries to let 'sort' deal with the problem of temporary > files. It should scale upwards to npar*ncmd files. It reads a list of > files to sort on stdin (standard input). Thus it does not work for > data on a pipe. I am sure it also has other bugs, but it might be OK > for your purpose. > > #!/bin/bash > > DIR=$(mktemp -d /tmp/parsort.XXXXX) > cat >$DIR/filenames > # Create a bunch of fifos for sorting into > cat $DIR/filenames | parallel -Xj0 mkfifo $DIR/sort-{#} > # Start the merge sort into the fifos > cat $DIR/filenames | parallel -Xj0 sort {} \>$DIR/sort-{#} & > > # Create a bunch of fifos for merging into > parallel -X mkfifo $DIR/merge-{#} ::: $DIR/sort-* > # Multilevel merge > parallel -X sort --batch-size=1000 -m {} \>$DIR/merge-{#} ::: $DIR/sort-* & > # Read and merge from the fifos > sort --batch-size=1000 -m $DIR/merge-* > # Remove the fifos > rm -rf $DIR > > For sorting the output from a pipe we need to split the pipe with > --pipe. Unfortunately we do not know how big the input is, so we do > not know how many chunks we have. It would be handy if GNU Parallel > could distribute the input among N processes so when one process is > finished reading the same process will get another chunk. > Unfortunately that is not how GNU Parallel works today. > > #!/bin/bash > > DIR=$(mktemp -d /tmp/parsort.XXXXX) > parallel --block 100M --pipe --files sort -S110000000 > $DIR/sort-files > > # Create a bunch of fifos for merging into > parallel -X mkfifo $DIR/merge-{#} :::: $DIR/sort-files > # Multilevel merge > parallel -X sort --batch-size=1000 -m {} \>$DIR/merge-{#} :::: > $DIR/sort-files & > # Read and merge from the fifos > sort --batch-size=1000 -m $DIR/merge-* > # Remove the fifos and the tempfiles > rm -rf $(cat $DIR/sort-files) $DIR > > > None of the solutions above are perfect for all situations, but for > some situations they should do a reasonable job. On my tests they are > both faster than GNU sort. > > > /Ole
