You're welcome.

@jyapayne \- Well, there is this 
    
    
    import strutils, posix
    proc main() =
      for line in lines(popen("gzip -dc < big.vcf.gz".cstring, "r".cstring)):
        if line.startsWith('#'):
          continue
        var i = 0
        for col in line.split('\t'):
          if i >= 9:
            for fmt in col.split(':'):
    #         do something with $fmt
              break
          i.inc
    main()
    
    
    Run

which takes about 255 seconds (4m 15s) on @jyapayne's sampel file on my 
machine, and then if you are willing to have a cligen dependency you can get it 
about 4x faster (67 seconds) with this: 
    
    
    import strutils, posix, cligen/mslice
    proc main() =
      for line in lines(popen("gzip -dc < big.vcf.gz".cstring, "r".cstring)):
        if line.startsWith('#'):
          continue
        var i = 0
        let msLine = toMSlice(line)
        for col in msLine.mSlices('\t'):
          if i >= 9:
            for fmt in col.mSlices(':'):
    #         do something with $fmt
              break
          i.inc
    main()
    
    
    Run

Latter needs `popen` analogue on Windows (I bet there's one in the stdlib, but 
I always just use `popen`), and he'll have to `nimble install cligen` but he 
may want to anyway.

Another perhaps non-obvious note about the `popen` approach is that the 
decompressor runs as a separate process, and so in parallel if you have at 
least two idle cores. Hence, run-time tends to be max(decompress time, parse 
time) rather than sum() of the two times. (Bandwidth across a unix pipe is 
usually >15GB/s anyway.) Anyway, Mark could probably run that 2nd version on 
his file. It might be a little faster or slower, but is probably within 2x 
which is still 5x faster than Groovy. Might be more than 10x faster.

To answer some of @markebbert's compression questions..It's a lot to cover, but 
I'll try to be brief. The `.zs` or `.gz` or `.bz2` or `.xz` formats are all 
different as are the related compression algorithms, but some tools support old 
formats for compatibility. You won't necessarily get any performance boost in 
any dimension (time/space/etc), though. To get good parallel speed-up of 
decompressed output, the compressor must prepare the input specially (in N 
independent file regions - so that N threads can be decompressing their streams 
independently). `pixz` can create such independent region `.xz` format files 
decompressible with parallel speed-up by itself or without parallel speed-up by 
regular `xz -d`. There is a similar `gzip` tool called `pigz` that can do 
similar for `.gz` files.

I don't know if `pzstd` can do like `pigz` for `gzip` with independent regions. 
`pigz` itself always used to focus only on compression (the slow but do it only 
once part) instead of decompression (the do it many times wouldn't it be great 
to be fast in parallel?).

Even if `pzstd` can unlike `pigz` get parallel un-gzip going, you won't get the 
giant 163:1 compression ratios, though - in fact they will be somewhat worse 
ratios than regular `gzip -9` ratios. Moving to Zstd native compression algos & 
formats is probably the best advice, but also the most work in terms of 
education, persuasion, etc. as I've mentioned before.

Of course, even my `memchr` based `mSlices` only runs about 2x faster than 
`gzip` decompresses. So max(4 seconds, 45 seconds) will be ~45s which isn't 
that much better than ~67s. So, "costs in context" and all that may apply. (I 
think 67s is better than my earlier 78s due to cache effects - decompress to 
the pipe & throw away all in L2 vs decompress to a RAM filesystem if anyone is 
keeping tabs on my numbers that closely which admittedly seems unlikely.)

Reply via email to