On Wed, Aug 17, 2011 at 1:13 PM, Nathan Watson-Haigh <[email protected]> wrote: > > I just found this tool today and I'm very impressed with it!
Good to see a fellow bioinformatician finds it useful. If you like GNU Parallel: * Post the intro videos on forums/blogs/Google+/Twitter/Facebook/Linkedin * Request or write a review for your favourite blog or magazine * Invite me as a speaker for your next conference If GNU Parallel saves you money: * (Have your company) donate to FSF https://my.fsf.org/donate/ > I've seen the example for parallelising sort....is there a way to spread the > merge part amoung more cores? I'm trying to do a sort over 200million lines - > the merge part takes several minutes to complete. > > How do I write a script for this so I can use it as a drop-in replacement for > a sort command used in the pipeline of another software tool? That is an interesting idea. If you 200M lines are in a single file, you can use 'parallel --pipe'. This will distribute the merge to all cores. Only the final merge will be done by one core. If bigfile is 1G you will in total read 3G and write 3G (which is clearly not optimal): cat bigfiles* | parallel --pipe --files sort | parallel --files -X sort -m {} ';' rm {} | parallel -Xj1 sort -m {} ';' rm {} >bigfile.sort However, --pipe is somewhat slow (less than 100MB/s), and thus may be a bottleneck. So for an optimal solution we probably need to split up into several situations. Number of files: nfiles (1 if reading from stdin (standard input)) Total size of files: sfiles Memory available: RAM Number of processes that UNIX allows to be run in parallel: npar Number of processors: nproc Number of filenames that fit on the command line: ncmd If sfiles < RAM then we should read all into RAM and process it there without temporary files. If nfiles > nproc and nfiles < npar we can run one process per file and do the merge on the fly. If nfiles > nproc and nfiles > npar we group the nfiles into npar groups. Each group is sorted sequentially but the groups are run in parallel. The output is merged on the fly. If nfiles < nproc the content of the files need to be split into nproc (or more) chunks. If sfiles > RAM then we need to use temporary files. It should be possible to only write a temporary file once (so in total to sort 1G you need to read 2G and write 2G). If nfiles > nproc and nfiles < npar we can run one process per file saving each result to a temporary file and finally do the merge of all the files. If nfiles > nproc and nfiles > npar we group the nfiles into npar groups. Each group is sorted sequentially but the groups are run in parallel with the output saved to a temporary file for each group. After the sort the tempfiles are merged. If nfiles < nproc the content of the files need to be split into nproc (or more) chunks. Each chunk is then sorted, saved to a tempfile and then merged. The below tries to let 'sort' deal with the problem of temporary files. It should scale upwards to npar*ncmd files. It reads a list of files to sort on stdin (standard input). Thus it does not work for data on a pipe. I am sure it also has other bugs, but it might be OK for your purpose. #!/bin/bash DIR=$(mktemp -d /tmp/parsort.XXXXX) cat >$DIR/filenames # Create a bunch of fifos for sorting into cat $DIR/filenames | parallel -Xj0 mkfifo $DIR/sort-{#} # Start the merge sort into the fifos cat $DIR/filenames | parallel -Xj0 sort {} \>$DIR/sort-{#} & # Create a bunch of fifos for merging into parallel -X mkfifo $DIR/merge-{#} ::: $DIR/sort-* # Multilevel merge parallel -X sort --batch-size=1000 -m {} \>$DIR/merge-{#} ::: $DIR/sort-* & # Read and merge from the fifos sort --batch-size=1000 -m $DIR/merge-* # Remove the fifos rm -rf $DIR For sorting the output from a pipe we need to split the pipe with --pipe. Unfortunately we do not know how big the input is, so we do not know how many chunks we have. It would be handy if GNU Parallel could distribute the input among N processes so when one process is finished reading the same process will get another chunk. Unfortunately that is not how GNU Parallel works today. #!/bin/bash DIR=$(mktemp -d /tmp/parsort.XXXXX) parallel --block 100M --pipe --files sort -S110000000 > $DIR/sort-files # Create a bunch of fifos for merging into parallel -X mkfifo $DIR/merge-{#} :::: $DIR/sort-files # Multilevel merge parallel -X sort --batch-size=1000 -m {} \>$DIR/merge-{#} :::: $DIR/sort-files & # Read and merge from the fifos sort --batch-size=1000 -m $DIR/merge-* # Remove the fifos and the tempfiles rm -rf $(cat $DIR/sort-files) $DIR None of the solutions above are perfect for all situations, but for some situations they should do a reasonable job. On my tests they are both faster than GNU sort. /Ole
