For what it's worth, and for completeness if Windows portability even matters 
in this case (as @markebbert mentioned, these science things are often one time 
deals), this works but is 6x slower (405 sec aka 6min 45sec) than the 
`popen`/`mSlices` variant: 
    
    
    import strutils, osproc, streams, cligen/mslice
    proc main() =
      let p = startProcess("gzip -dc < big.vcf.gz", options={poEvalCommand})
      let outp = p.outputStream
      var line = newStringOfCap(4096).TaintedString
      while outp.readLine(line):
        if line.startsWith('#'):
          continue
        var i = 0
        let msLine = toMSlice(line)
        for col in msLine.mSlices('\t'):
          if i >= 9:
            for fmt in col.mSlices(':'):
    #         do something with $fmt
              break
          i.inc
    main()
    
    
    Run

That `streams` code needs some better line-buffering love, though { Or `osproc` 
could use `File` instead of `Stream`}. `system/io.nim:readLine(File,..)` used 
to be a similarly slow almost identical implementation.

But, the clear speed winners so far are either the `mopen` variant decompressed 
if you have the space/RAM or, if you run on Unix, the 
`lines(popen())`-`mSlices` variant (re-encoded with `pzstd` if you need to 
process the same file many times). { If `nimble install` doesn't work for you, 
in a pinch, you could always `git clone https://github.com/c-blake/cligen`, 
copy `cligen/mslice.nim` into the same dir as your program and adjust the 
`import` to its unqualified name. I get that Araq doesn't want to rely upon 
libc `memchr` being fast or support different compile-time/run-time versions, 
but 4X slower is a pretty big hit. That's why I tossed `mslice` into `cligen` 
so others might benefit. I'm not even sure `mSlices` is as fast as possible and 
as I mentioned various overheads clearly depend on string/substring lengths. }

Reply via email to