RyanBerti commented on PR #39678:
URL: https://github.com/apache/spark/pull/39678#issuecomment-1399332706
Hi @dtenedor and @huaxingao
Thanks for the input! I agree with you both that migrating Spark's existing
HLL++ implementation to use the Apache Datasketches library would be ideal.
Unfortunately, migrating to another HLL++ implementation would require a bit
more work, when compared with exposing the sketch associated with the existing
implementation. Here are some of my initial thoughts on what it would take to
migrate Spark's existing HLL++ implementations to use the Apache Datasketches
library:
- Add [datasketches-java](https://github.com/apache/datasketches-java) as a
new dependency to Spark
- Add one or more new AggregateFunctions which forward calls from
ImperativeAggregate.update and Expression.eval to a [Datasketch HllSketch
instance](https://datasketches.apache.org/api/java/snapshot/apidocs/org/apache/datasketches/hll/HllSketch.html),
and forward calls from ImperativeAggregate.merge to a [Datasketch Union
instance](https://datasketches.apache.org/api/java/snapshot/apidocs/org/apache/datasketches/hll/Union.html).
I think there's technical complexity/unknowns involved in instantiating the
HllSketch and Union instances such that they store their intermediate state in
the AggregationFunction's aggregation buffer - this would likely involve
wrapping the agg buffer in an implementation of Datasketch's [WritableMemory
interface](https://datasketches.apache.org/api/memory/snapshot/apidocs/org/apache/datasketches/memory/WritableMemory.html).
- Provide a mechanism for existing users to utilize the new Datasketches
based HLL implementation for approx_count_distinct +
approx_distinct_count_for_intervals (and utilize the existing
HyperLogLogPlusPlus implementation by default?)
I'm not as familiar with Theta Sketches, but it sounds like that sketch
format would be best for cross-compatibility between
Spark/Presto+Trino/Iceberg, though I'm not sure if there's additional work to
support Theta Sketches when compared with HLL sketches.
Given the increased scope, I'm not sure I can commit to working through this
migration in the short term. My opinion is that exposing the existing HLL
sketches would still be beneficial, and would not block the migration to Apache
Datasketches in the future.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org