FWIW, I suspect the answer to all this noise is that @markebbert was simply not using an optimized compile (as suggested by the very first line of the very first response to him).
@jyapayne \- what I did was go to [https://vcftools.github.io/index.html](https://vcftools.github.io/index.html) and download the distro and find the biggest VCF file in there (`contrast.vcf`). Then I did something like this: #!/bin/zsh -f head -n46 contrast.vcf > head.vcf tail -n+47 contrast.vcf > tail.vcf hardTab=$'\t' ## `cols` is from `cligen/examples` cols -c -d "$hardTab" {1..7} < tail.vcf > tail-more.vcf ## Above has 106 distinct rows. Probably diverse enough. paste tail.vcf $(repeat 2500 echo tail-more.vcf) > tail-wide.vcf cat head.vcf $(repeat 30 echo tail-wide.vcf) > big.vcf Run That last file is only about 1.5 GB, likely about 60x smaller than @markebbert's but should have otherwise similar statistics (15,000 columns) and fit in almost anyone's RAM. (It also compresses via Zstd to under 500 _kB_ due to the way it was synthesized and decompresses in under 1/4 of a second for me...) Then I just ran his initial Python and Nim (dropping the gzip stuff and adjusting path names). I reproduced the unsurprising Nim debug-build slowness (11.8 seconds) with his Python running at 4.81 seconds. Then with `-d:danger` got the Nim running in 2.07 seconds about 2.3X faster than the Python. Then, just for kicks, I did a version using libraries that I alluded to which re-uses the same two `seq` for column outputs and got the Nim to 0.984 seconds, almost 5X faster than his Python: import cligen/[mfile, mslice] proc main() = # var genotype: string var i = 0 var cols, subCols: seq[MSlice] for line in mSlices(mopen("big.vcf")): if line.len > 0 and line[0] == '#': continue discard line.msplit(cols, '\t', 0) for col in cols: if i >= 9: discard col.msplit(subCols, ':', 0) for fmt in subCols: # genotype = $fmt #Py did not do anything here.. break inc(i) main() Run What further optimizations make sense, such as eliding many splits by bounding columns, using iterators rather than `seq`, etc., ultimately depend upon what further calculation he was intending to do with the parsed data. Scaling up to his problem size, this last version would translate to about 1 minute run time vs his initial 400 minutes. "In real life", likely 90+% of his time would be spent waiting on `gunzip`. Literally any compressor allowing parallel decompression (`pixz`, `pzstd`, etc.) or just fast single-threaded decompression like `lz4` would probably be much less of a pain point for him. I realize, though, that he may have piles of giant data files already "trapped behind gunzip" in ways beyond his control. Converting them may still help him if he has many repeated calculations to do and the disk space.