Implementation for approx_count_distinct_sketch and associated functions

Ryan Berti Wed, 11 Jan 2023 15:24:48 -0800

Hello!

I've recently wanted to write the sketches associated with the
approx_count_distinct function to allow for distinct count re-aggregation. This
2019 databricks post
<https://www.databricks.com/blog/2019/05/08/advanced-analytics-with-apache-spark.html>
proposes
the use of spark-alchemy, and I've also seen other discussions which
propose building custom UDAFs/UDFs to achieve the desired effect. Trino
supports re-aggregating HLL sketches
<https://trino.io/docs/current/functions/hyperloglog.html> natively, and I
figured Spark should also provide this functionality natively.


After searching the Spark JIRA and this dev list, I found a few requests
for this functionality:

   - Here's a ticket that was closed (due to inactivity?) in 2019
   <https://issues.apache.org/jira/browse/SPARK-16484>, where there seemed
   to be agreement that adding the requested implementation would be simple
   - Here's a discussion in this dev list from 2015
   <https://lists.apache.org/thread/pqjopxh897wx9b39y2tg1g4bot2d86df>,
   which focuses on implementing the functionality via legacy(?) APIs, and
   interoperability between HLL implementations.

I've implemented two new agg functions and a new misc function
<https://github.com/RyanBerti/spark/pull/1> that handle HLL++ sketches, and
I'd like to open my implementation up for review. Can someone help me
re-open SPARK-16484 <https://issues.apache.org/jira/browse/SPARK-16484>,
and then I'll move forward with opening a PR against the main spark repo?

Thanks!

Ryan Berti

Senior Data Engineer  |  Ads DE

M 7023217573

5808 W Sunset Blvd  |  Los Angeles, CA 90028

Implementation for approx_count_distinct_sketch and associated functions

Reply via email to