Re: Implementation for approx_count_distinct_sketch and associated functions

Ryan Berti Fri, 20 Jan 2023 10:28:04 -0800

Hello,

Wanted to follow up and link out the Spark PR associated with these changes
<https://github.com/apache/spark/pull/39678>; I'm excited to open up the
implementation for community review!


For reference, I worked with @Daniel Tenedorio
<daniel.tenedo...@databricks.com> and the Databricks team on a pre-review
<https://github.com/RyanBerti/spark/pull/1>, which yielded some interesting
discussions about the existing HyperLogLogPlusPlus implementation. I think
we're in agreement that having a cross-compatible HLL++ implementation
would be super valuable, though I didn't attempt to take that work on in
this PR. I've included a format identifier in this implementation's HLL++
sketches to set us up for migrating to a cross-compatible sketch format /
HLL++ implementation in the future.

Thanks

Ryan Berti

Senior Data Engineer  |  Ads DE

M 7023217573

5808 W Sunset Blvd  |  Los Angeles, CA 90028



On Wed, Jan 11, 2023 at 3:23 PM Ryan Berti <rbe...@netflix.com> wrote:

> Hello!
>
> I've recently wanted to write the sketches associated with the
> approx_count_distinct function to allow for distinct count re-aggregation. 
> This
> 2019 databricks post
> <https://www.databricks.com/blog/2019/05/08/advanced-analytics-with-apache-spark.html>
>  proposes
> the use of spark-alchemy, and I've also seen other discussions which
> propose building custom UDAFs/UDFs to achieve the desired effect. Trino
> supports re-aggregating HLL sketches
> <https://trino.io/docs/current/functions/hyperloglog.html> natively, and
> I figured Spark should also provide this functionality natively.
>
> After searching the Spark JIRA and this dev list, I found a few requests
> for this functionality:
>
>    - Here's a ticket that was closed (due to inactivity?) in 2019
>    <https://issues.apache.org/jira/browse/SPARK-16484>, where there
>    seemed to be agreement that adding the requested implementation would be
>    simple
>    - Here's a discussion in this dev list from 2015
>    <https://lists.apache.org/thread/pqjopxh897wx9b39y2tg1g4bot2d86df>,
>    which focuses on implementing the functionality via legacy(?) APIs, and
>    interoperability between HLL implementations.
>
> I've implemented two new agg functions and a new misc function
> <https://github.com/RyanBerti/spark/pull/1> that handle HLL++ sketches,
> and I'd like to open my implementation up for review. Can someone help me
> re-open SPARK-16484 <https://issues.apache.org/jira/browse/SPARK-16484>,
> and then I'll move forward with opening a PR against the main spark repo?
>
> Thanks!
>
> Ryan Berti
>
> Senior Data Engineer  |  Ads DE
>
> M 7023217573
>
> 5808 W Sunset Blvd  |  Los Angeles, CA 90028
>
>

Re: Implementation for approx_count_distinct_sketch and associated functions

Reply via email to