Github user hvanhovell commented on the pull request: https://github.com/apache/spark/pull/8362#issuecomment-133566608 Thanks. I was aiming for compatibility with the existing approxCountDistinct, but we can also implement HLL++. HLL++ introduces three (orthogonal) refinements: 64-bit hashing, better low cardinality corrections and a sparse encoding scheme. The first two refinements are easy to add. The third will require a bit more effort. Unit testing this is a bit of a challenge. End-to-end (blackbox) testing is no problem, as long as we know what the result should be, or if we do random testing (results should be within 5% of the actual value). Testing parts of the algorithm is a bit of a PITA: * It is hard to reason about the results (the updated registers) HLL produces. * Register access code and HLL code are intertwined. Both the [ClearSpring](https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLog.java) and [AggregateKnowledge](https://github.com/aggregateknowledge/java-hll/blob/master/src/main/java/net/agkn/hll/HLL.java) implementations resort to blackbox testing. I will create some blackbox tests.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org