FWIW, I suspect the answer to all this noise is that @markebbert was simply not 
using an optimized compile (as suggested by the very first line of the very 
first response to him).

@jyapayne \- what I did was go to 
[https://vcftools.github.io/index.html](https://vcftools.github.io/index.html) 
and download the distro and find the biggest VCF file in there 
(`contrast.vcf`). Then I did something like this: 
    
    
    #!/bin/zsh -f
    head -n46 contrast.vcf > head.vcf
    tail -n+47 contrast.vcf > tail.vcf
    hardTab=$'\t'
    ## `cols` is from `cligen/examples`
    cols -c -d "$hardTab" {1..7} < tail.vcf > tail-more.vcf
    ## Above has 106 distinct rows. Probably diverse enough.
    paste tail.vcf $(repeat 2500 echo tail-more.vcf) > tail-wide.vcf
    cat head.vcf $(repeat 30 echo tail-wide.vcf) > big.vcf
    
    
    Run

That last file is only about 1.5 GB, likely about 60x smaller than 
@markebbert's but should have otherwise similar statistics (15,000 columns) and 
fit in almost anyone's RAM. (It also compresses via Zstd to under 500 _kB_ due 
to the way it was synthesized and decompresses in under 1/4 of a second for 
me...)

Then I just ran his initial Python and Nim (dropping the gzip stuff and 
adjusting path names). I reproduced the unsurprising Nim debug-build slowness 
(11.8 seconds) with his Python running at 4.81 seconds. Then with `-d:danger` 
got the Nim running in 2.07 seconds about 2.3X faster than the Python.

Then, just for kicks, I did a version using libraries that I alluded to which 
re-uses the same two `seq` for column outputs and got the Nim to 0.984 seconds, 
almost 5X faster than his Python: 
    
    
    import cligen/[mfile, mslice]
    
    proc main() =
    # var genotype: string
      var i = 0
      var cols, subCols: seq[MSlice]
      for line in mSlices(mopen("big.vcf")):
        if line.len > 0 and line[0] == '#':
          continue
        discard line.msplit(cols, '\t', 0)
        for col in cols:
          if i >= 9:
            discard col.msplit(subCols, ':', 0)
            for fmt in subCols:
    #         genotype = $fmt #Py did not do anything here..
              break
          inc(i)
    main()
    
    
    Run

What further optimizations make sense, such as eliding many splits by bounding 
columns, using iterators rather than `seq`, etc., ultimately depend upon what 
further calculation he was intending to do with the parsed data.

Scaling up to his problem size, this last version would translate to about 1 
minute run time vs his initial 400 minutes. "In real life", likely 90+% of his 
time would be spent waiting on `gunzip`. Literally any compressor allowing 
parallel decompression (`pixz`, `pzstd`, etc.) or just fast single-threaded 
decompression like `lz4` would probably be much less of a pain point for him. I 
realize, though, that he may have piles of giant data files already "trapped 
behind gunzip" in ways beyond his control. Converting them may still help him 
if he has many repeated calculations to do and the disk space.

Reply via email to