Some additional observations: * Running on tmpfs on Linux is ~ 30% faster than with ImDisk on W10 for me. * WinFSP is still unreliable for any serious work. The accumulated csv was just truncated on write with no errors. * Using channels for threading gives a significant memory overhead. Some permutations even get OOM on my 16GB laptop, even though there shouldn't really be much copying. Looks like memory isn't freed fast enough, but I didn't investigate. This is for my pathological set of data (long tables, short seqs), of course. * Using a global table accessed from threads with a lock is just overhead. Interestingly, it's slower than a single threaded version on Linux, but faster on Windows. Collecting intermediate per-thread tables and then merging them could be faster than immediate access.
All in all, adding threading with what's readily available turned out pretty meh. May be it's just _my_ code.