Hello, Wanted to follow up and link out the Spark PR associated with these changes <https://github.com/apache/spark/pull/39678>; I'm excited to open up the implementation for community review!
For reference, I worked with @Daniel Tenedorio <daniel.tenedo...@databricks.com> and the Databricks team on a pre-review <https://github.com/RyanBerti/spark/pull/1>, which yielded some interesting discussions about the existing HyperLogLogPlusPlus implementation. I think we're in agreement that having a cross-compatible HLL++ implementation would be super valuable, though I didn't attempt to take that work on in this PR. I've included a format identifier in this implementation's HLL++ sketches to set us up for migrating to a cross-compatible sketch format / HLL++ implementation in the future. Thanks Ryan Berti Senior Data Engineer | Ads DE M 7023217573 5808 W Sunset Blvd | Los Angeles, CA 90028 On Wed, Jan 11, 2023 at 3:23 PM Ryan Berti <rbe...@netflix.com> wrote: > Hello! > > I've recently wanted to write the sketches associated with the > approx_count_distinct function to allow for distinct count re-aggregation. > This > 2019 databricks post > <https://www.databricks.com/blog/2019/05/08/advanced-analytics-with-apache-spark.html> > proposes > the use of spark-alchemy, and I've also seen other discussions which > propose building custom UDAFs/UDFs to achieve the desired effect. Trino > supports re-aggregating HLL sketches > <https://trino.io/docs/current/functions/hyperloglog.html> natively, and > I figured Spark should also provide this functionality natively. > > After searching the Spark JIRA and this dev list, I found a few requests > for this functionality: > > - Here's a ticket that was closed (due to inactivity?) in 2019 > <https://issues.apache.org/jira/browse/SPARK-16484>, where there > seemed to be agreement that adding the requested implementation would be > simple > - Here's a discussion in this dev list from 2015 > <https://lists.apache.org/thread/pqjopxh897wx9b39y2tg1g4bot2d86df>, > which focuses on implementing the functionality via legacy(?) APIs, and > interoperability between HLL implementations. > > I've implemented two new agg functions and a new misc function > <https://github.com/RyanBerti/spark/pull/1> that handle HLL++ sketches, > and I'd like to open my implementation up for review. Can someone help me > re-open SPARK-16484 <https://issues.apache.org/jira/browse/SPARK-16484>, > and then I'll move forward with opening a PR against the main spark repo? > > Thanks! > > Ryan Berti > > Senior Data Engineer | Ads DE > > M 7023217573 > > 5808 W Sunset Blvd | Los Angeles, CA 90028 > >