[GitHub] [spark] RyanBerti commented on pull request #39678: [SPARK-16484][SQL] Add HyperLogLogPlusPlus sketch generator/evaluator/aggregator

2023-05-03 Thread via GitHub


RyanBerti commented on PR #39678:
URL: https://github.com/apache/spark/pull/39678#issuecomment-1533691734

   Closing in favor of https://github.com/apache/spark/pull/40615


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] RyanBerti commented on pull request #39678: [SPARK-16484][SQL] Add HyperLogLogPlusPlus sketch generator/evaluator/aggregator

2023-01-21 Thread via GitHub


RyanBerti commented on PR #39678:
URL: https://github.com/apache/spark/pull/39678#issuecomment-1399332706

   Hi @dtenedor and @huaxingao 
   
   Thanks for the input! I agree with you both that migrating Spark's existing 
HLL++ implementation to use the Apache Datasketches library would be ideal. 
Unfortunately, migrating to another HLL++ implementation would require a bit 
more work, when compared with exposing the sketch associated with the existing 
implementation. Here are some of my initial thoughts on what it would take to 
migrate Spark's existing HLL++ implementations to use the Apache Datasketches 
library:
   
   - Add [datasketches-java](https://github.com/apache/datasketches-java) as a 
new dependency to Spark
   - Add one or more new AggregateFunctions which forward calls from 
ImperativeAggregate.update and Expression.eval to a [Datasketch HllSketch 
instance](https://datasketches.apache.org/api/java/snapshot/apidocs/org/apache/datasketches/hll/HllSketch.html),
 and forward calls from ImperativeAggregate.merge to a [Datasketch Union 
instance](https://datasketches.apache.org/api/java/snapshot/apidocs/org/apache/datasketches/hll/Union.html).
 I think there's technical complexity/unknowns involved in instantiating the 
HllSketch and Union instances such that they store their intermediate state in 
the AggregationFunction's aggregation buffer - this would likely involve 
wrapping the agg buffer in an implementation of Datasketch's [WritableMemory 
interface](https://datasketches.apache.org/api/memory/snapshot/apidocs/org/apache/datasketches/memory/WritableMemory.html).
   - Provide a mechanism for existing users to utilize the new Datasketches 
based HLL implementation for approx_count_distinct + 
approx_distinct_count_for_intervals (and utilize the existing 
HyperLogLogPlusPlus implementation by default?)
   
   I'm not as familiar with Theta Sketches, but it sounds like that sketch 
format would be best for cross-compatibility between 
Spark/Presto+Trino/Iceberg, though I'm not sure if there's additional work to 
support Theta Sketches when compared with HLL sketches. 
   
   Given the increased scope, I'm not sure I can commit to working through this 
migration in the short term. My opinion is that exposing the existing HLL 
sketches would still be beneficial, and would not block the migration to Apache 
Datasketches in the future. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] RyanBerti commented on pull request #39678: [SPARK-16484][SQL] Add HyperLogLogPlusPlus sketch generator/evaluator/aggregator

2023-01-20 Thread via GitHub


RyanBerti commented on PR #39678:
URL: https://github.com/apache/spark/pull/39678#issuecomment-1398762642

   For reference, @dtenedor worked with me on a pre-review of these changes; 
relevant discussions are available in this PR in my fork: 
https://github.com/RyanBerti/spark/pull/1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org