Well, I don't really have a blog. So, this is what you get. ;-) Someone else can, though. Ideally, just give me credit by linking back to here. Or if you can make any of that `cligen/mslice.*split*` faster then a PR is welcome.
As a slight update, storing `big.vcf` in `/tmp` (a tmpfs aka `/dev/shm` RAM filesystem) and doing profile-guided optimization with gcc (via a little `nim-pgo vsn2 ./vsn2` script I have around), I get about 15% improvement down to 0.85 s. Linux `perf` tells me about 58% of that time is just in `__memchr_avx2` which is already hand assembly tuned. That is probably about as fast as you can get (at least on my Skylake generation CPU) in any language. You might be able to eek out another 10..20% or more if you hand-rolled everything in assembly and did just the right prefetches/branch predictor gaming. Or you might not. Parsing at 1.85 GB/s isn't so bad. For reference, on that machine single-threaded RAM bandwidth is about 32 GB/s, and as mentioned Zstd can spit out that compressed file about 3x faster { though that is multi-threaded over 4+ cores..So, it's actually slightly slower on a single-core basis }. Anyway, I doubt there is any real Nim problem here. A better use of time might have been to just wait for @markebbert to respond to the very first line of the very first response.