You're welcome. @jyapayne \- Well, there is this import strutils, posix proc main() = for line in lines(popen("gzip -dc < big.vcf.gz".cstring, "r".cstring)): if line.startsWith('#'): continue var i = 0 for col in line.split('\t'): if i >= 9: for fmt in col.split(':'): # do something with $fmt break i.inc main() Run
which takes about 255 seconds (4m 15s) on @jyapayne's sampel file on my machine, and then if you are willing to have a cligen dependency you can get it about 4x faster (67 seconds) with this: import strutils, posix, cligen/mslice proc main() = for line in lines(popen("gzip -dc < big.vcf.gz".cstring, "r".cstring)): if line.startsWith('#'): continue var i = 0 let msLine = toMSlice(line) for col in msLine.mSlices('\t'): if i >= 9: for fmt in col.mSlices(':'): # do something with $fmt break i.inc main() Run Latter needs `popen` analogue on Windows (I bet there's one in the stdlib, but I always just use `popen`), and he'll have to `nimble install cligen` but he may want to anyway. Another perhaps non-obvious note about the `popen` approach is that the decompressor runs as a separate process, and so in parallel if you have at least two idle cores. Hence, run-time tends to be max(decompress time, parse time) rather than sum() of the two times. (Bandwidth across a unix pipe is usually >15GB/s anyway.) Anyway, Mark could probably run that 2nd version on his file. It might be a little faster or slower, but is probably within 2x which is still 5x faster than Groovy. Might be more than 10x faster. To answer some of @markebbert's compression questions..It's a lot to cover, but I'll try to be brief. The `.zs` or `.gz` or `.bz2` or `.xz` formats are all different as are the related compression algorithms, but some tools support old formats for compatibility. You won't necessarily get any performance boost in any dimension (time/space/etc), though. To get good parallel speed-up of decompressed output, the compressor must prepare the input specially (in N independent file regions - so that N threads can be decompressing their streams independently). `pixz` can create such independent region `.xz` format files decompressible with parallel speed-up by itself or without parallel speed-up by regular `xz -d`. There is a similar `gzip` tool called `pigz` that can do similar for `.gz` files. I don't know if `pzstd` can do like `pigz` for `gzip` with independent regions. `pigz` itself always used to focus only on compression (the slow but do it only once part) instead of decompression (the do it many times wouldn't it be great to be fast in parallel?). Even if `pzstd` can unlike `pigz` get parallel un-gzip going, you won't get the giant 163:1 compression ratios, though - in fact they will be somewhat worse ratios than regular `gzip -9` ratios. Moving to Zstd native compression algos & formats is probably the best advice, but also the most work in terms of education, persuasion, etc. as I've mentioned before. Of course, even my `memchr` based `mSlices` only runs about 2x faster than `gzip` decompresses. So max(4 seconds, 45 seconds) will be ~45s which isn't that much better than ~67s. So, "costs in context" and all that may apply. (I think 67s is better than my earlier 78s due to cache effects - decompress to the pipe & throw away all in L2 vs decompress to a RAM filesystem if anyone is keeping tabs on my numbers that closely which admittedly seems unlikely.)