I decided to time this several ways just for kicks. I did this on a 16-processor Xeon X7350, using 96 very compressible text files of about 95 MB compressed each (total 9 GB compressed, 236 GB uncompressed), with the input in cache, reading and writing on different 300+ MB/sec RAID arrays. YMMV.
1. If you just need to treat your files as one big file for streaming input to some other program, you can use process substitution: other-prog <(zcat *.fastq.gz). This is about the fastest and most space-efficient you can hope for, but it may not work in your situation. 0 sec (1271 sec to run cat <(zcat *.fastq.gz) > /dev/null) 2. Note that the concatenation of multiple gzip files is a valid gzip file, so you may not need to unzip them. Beware that there are some programs that don't correctly unzip such files (I'm looking at you, Java). $ cat *.fastq.gz > output.fastq.gz 44 sec 3. If you do end up needing to recompress them, you could look into pigz, the "parallel implementation of gzip." Note that it puts out the same kind of concatenated gzip files that some systems don't read correctly. $ zcat *.fastq.gz | pigz > output.fastq.gz 1320 sec 3.5. gzip files can't really be decompressed in parallel, but unpigz tries its best: $ unpigz -c *.fastq.gz | pigz > output.fastq.gz 1099 sec 4. If you're really stuck on parallel ;), you have lots of free memory (size of largest uncompressed input * number of processes), and don't care in what order the files are merged, something like this might give you a small improvement: $ TMPDIR=/dev/shm parallel zcat ::: *.fastq.gz | pigz > output.fastq.gz 961 sec And finally, the simplest and most compatible but slowest method: $ zcat *.fastq.gz | gzip > output.fastq.gz 3259 sec Good hunting. On Tue, Nov 8, 2011 at 12:07 AM, vijai2007 <[email protected]> wrote: > Hello, > I have about 99 files in the file name format: > SRRA_ATCACG_L008_R1_001.fastq.gz to SRRA_ATCACG_L008_R1_099.fastq.gz > I want to unzip and merge them to a single fastq file. > These are from the Illumina CASAVA 1.8.1 > How do I do this in GNU parallel? > Thanks > vijai
