After `-d:danger -d:release` and gcc-9.2 on an i7-6700k at 4.8GHz this runs in 
about 46 seconds for me against the decompressed file in a RAM filesystem: 
    
    
    import cligen/[mfile, mslice]
    proc main() =
      for line in mSlices(mopen("big.vcf")):
        if line.len > 0 and line[0] == '#':
          continue
        var i = 0
        for col in line.mSlices('\t'):
          if i >= 9:
            for fmt in col.mSlices(':'):
    #         do something with $fmt
              break
          i.inc
    main()
    
    
    Run

Note that the i=0 should be moved down for the loop to be similar. The first 
`[9..]` Nim slice version was not quite converted correctly to the iterator 
version. That mistake propagated to @jyapayne's version. Mark may well have 
caught that already.

Also note that there was some kind of misunderstanding earlier about `maxSplit` 
helping a lot, but the code seems to want to parse all columns _except_ the 9 
early header columns.

Anyway, time beyond about 40seconds .. 1 minute of "parsing time" should just 
be IO. And that IO could be reduced to probably 4 seconds with Zstd, but may be 
more like 1.5 minutes with gzip. So, I might expect times somewhere in the 2-3 
minute range for @jyapayne's sample file.

The statistics of the two files sound pretty different. @jyapayne's eg file is 
is 3.5e6 \n chars and 8.7e9 \t chars while @markebbert reported (reportedly) 
300e3 \n chars and 4.5e9 \t (300k*15k). I might expect that @markebbert's might 
run faster, but his data columns may be larger.

Reply via email to