On Tuesday, 23 February 2021 at 00:08:40 UTC, tsbockman wrote:
On Friday, 19 February 2021 at 00:13:19 UTC, Jon Degenhardt wrote:
It would be interesting to see how the performance compares to tsv-uniq (https://github.com/eBay/tsv-utils/tree/master/tsv-uniq). The prebuilt binaries turn on all the optimizations (https://github.com/eBay/tsv-utils/releases).

My program (called line-dedup below) is modestly faster than yours, with the gap gradually widening as files get bigger. Similarly, when not using a memory-mapped scratch file, my program is modestly less memory hungry than yours, with the gap gradually widening as files get bigger.

In neither case is the difference very exciting though; the real benefit of my algorithm is that it can process files too large for physical memory. It might also handle frequent hash collisions better, and could be upgraded to handle huge numbers of very short lines efficiently.

Thanks for running the comparison! I appreciate seeing how other implementations compare.

I'd characterize the results a differently though. Based on the numbers, line-dedup is materially faster than tsv-uniq, at least on the tests run. To your point, it may not make much practical difference on data sets that fit in memory. tsv-uniq is fast enough for most needs. But it's still a material performance delta. Nice job!

I agree also that the bigger pragmatic benefit is fast processing of files much larger than will fit in memory. There are other useful problems like this. One I often need is creating a random weighted ordering. Easy to do for data sets that fit in memory, but hard to do fast for data sets that do not.

--Jon


Reply via email to