Hello! I've recently wanted to write the sketches associated with the approx_count_distinct function to allow for distinct count re-aggregation. This 2019 databricks post <https://www.databricks.com/blog/2019/05/08/advanced-analytics-with-apache-spark.html> proposes the use of spark-alchemy, and I've also seen other discussions which propose building custom UDAFs/UDFs to achieve the desired effect. Trino supports re-aggregating HLL sketches <https://trino.io/docs/current/functions/hyperloglog.html> natively, and I figured Spark should also provide this functionality natively.
After searching the Spark JIRA and this dev list, I found a few requests for this functionality: - Here's a ticket that was closed (due to inactivity?) in 2019 <https://issues.apache.org/jira/browse/SPARK-16484>, where there seemed to be agreement that adding the requested implementation would be simple - Here's a discussion in this dev list from 2015 <https://lists.apache.org/thread/pqjopxh897wx9b39y2tg1g4bot2d86df>, which focuses on implementing the functionality via legacy(?) APIs, and interoperability between HLL implementations. I've implemented two new agg functions and a new misc function <https://github.com/RyanBerti/spark/pull/1> that handle HLL++ sketches, and I'd like to open my implementation up for review. Can someone help me re-open SPARK-16484 <https://issues.apache.org/jira/browse/SPARK-16484>, and then I'll move forward with opening a PR against the main spark repo? Thanks! Ryan Berti Senior Data Engineer | Ads DE M 7023217573 5808 W Sunset Blvd | Los Angeles, CA 90028