Well, I don't really have a blog. So, this is what you get. ;-) Someone else
can, though. Ideally, just give me credit by linking back to here. Or if you
can make any of that `cligen/mslice.*split*` faster then a PR is welcome.
As a slight update, storing `big.vcf` in `/tmp` (a tmpfs aka `/dev/shm` RAM
filesystem) and doing profile-guided optimization with gcc (via a little
`nim-pgo vsn2 ./vsn2` script I have around), I get about 15% improvement down
to 0.85 s. Linux `perf` tells me about 58% of that time is just in
`__memchr_avx2` which is already hand assembly tuned.
That is probably about as fast as you can get (at least on my Skylake
generation CPU) in any language. You might be able to eek out another 10..20%
or more if you hand-rolled everything in assembly and did just the right
prefetches/branch predictor gaming. Or you might not. Parsing at 1.85 GB/s
isn't so bad. For reference, on that machine single-threaded RAM bandwidth is
about 32 GB/s, and as mentioned Zstd can spit out that compressed file about 3x
faster { though that is multi-threaded over 4+ cores..So, it's actually
slightly slower on a single-core basis }.
Anyway, I doubt there is any real Nim problem here. A better use of time might
have been to just wait for @markebbert to respond to the very first line of the
very first response.