After `-d:danger -d:release` and gcc-9.2 on an i7-6700k at 4.8GHz this runs in about 46 seconds for me against the decompressed file in a RAM filesystem: import cligen/[mfile, mslice] proc main() = for line in mSlices(mopen("big.vcf")): if line.len > 0 and line[0] == '#': continue var i = 0 for col in line.mSlices('\t'): if i >= 9: for fmt in col.mSlices(':'): # do something with $fmt break i.inc main() Run
Note that the i=0 should be moved down for the loop to be similar. The first `[9..]` Nim slice version was not quite converted correctly to the iterator version. That mistake propagated to @jyapayne's version. Mark may well have caught that already. Also note that there was some kind of misunderstanding earlier about `maxSplit` helping a lot, but the code seems to want to parse all columns _except_ the 9 early header columns. Anyway, time beyond about 40seconds .. 1 minute of "parsing time" should just be IO. And that IO could be reduced to probably 4 seconds with Zstd, but may be more like 1.5 minutes with gzip. So, I might expect times somewhere in the 2-3 minute range for @jyapayne's sample file. The statistics of the two files sound pretty different. @jyapayne's eg file is is 3.5e6 \n chars and 8.7e9 \t chars while @markebbert reported (reportedly) 300e3 \n chars and 4.5e9 \t (300k*15k). I might expect that @markebbert's might run faster, but his data columns may be larger.