On Tuesday, 23 February 2021 at 00:08:40 UTC, tsbockman wrote:
On Friday, 19 February 2021 at 00:13:19 UTC, Jon Degenhardt
wrote:
It would be interesting to see how the performance compares to
tsv-uniq
(https://github.com/eBay/tsv-utils/tree/master/tsv-uniq). The
prebuilt binaries turn on all the optimizations
(https://github.com/eBay/tsv-utils/releases).
My program (called line-dedup below) is modestly faster than
yours, with the gap gradually widening as files get bigger.
Similarly, when not using a memory-mapped scratch file, my
program is modestly less memory hungry than yours, with the gap
gradually widening as files get bigger.
In neither case is the difference very exciting though; the
real benefit of my algorithm is that it can process files too
large for physical memory. It might also handle frequent hash
collisions better, and could be upgraded to handle huge numbers
of very short lines efficiently.
Thanks for running the comparison! I appreciate seeing how other
implementations compare.
I'd characterize the results a differently though. Based on the
numbers, line-dedup is materially faster than tsv-uniq, at least
on the tests run. To your point, it may not make much practical
difference on data sets that fit in memory. tsv-uniq is fast
enough for most needs. But it's still a material performance
delta. Nice job!
I agree also that the bigger pragmatic benefit is fast processing
of files much larger than will fit in memory. There are other
useful problems like this. One I often need is creating a random
weighted ordering. Easy to do for data sets that fit in memory,
but hard to do fast for data sets that do not.
--Jon