On Wednesday, 17 February 2021 at 04:10:24 UTC, tsbockman wrote:
I spent some time experimenting with this problem, and here is the best solution I found, assuming that perfect de-duplication is required. (I'll put the code up on GitHub / dub if anyone wants to have a look.)

It would be interesting to see how the performance compares to tsv-uniq (https://github.com/eBay/tsv-utils/tree/master/tsv-uniq). The prebuilt binaries turn on all the optimizations (https://github.com/eBay/tsv-utils/releases).

tsv-uniq wasn't included in the different comparative benchmarks I published, but I did run my own benchmarks and it holds up well. However, it should not be hard to beat it. What might be more interesting is what the delta is.

tsv-uniq is using the most straightforward approach of popping things into an associate array. No custom data structures. Enough memory is required to hold all the unique keys in memory, so it won't handle arbitrarily large data sets. It would be interesting to see how the straightforward approach compares with the more highly tuned approach.

--Jon

Reply via email to