[ https://issues.apache.org/jira/browse/SPARK-52407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17956433#comment-17956433 ]
Christopher Boumalhab commented on SPARK-52407: ----------------------------------------------- https://datasketches.apache.org/docs/Theta/ThetaSketches.html > Add Native Support for Apache Theta Sketches > -------------------------------------------- > > Key: SPARK-52407 > URL: https://issues.apache.org/jira/browse/SPARK-52407 > Project: Spark > Issue Type: New Feature > Components: PySpark, Spark Core, SQL > Affects Versions: 3.4.4 > Reporter: Christopher Boumalhab > Priority: Minor > Original Estimate: 672h > Remaining Estimate: 672h > > *Add Theta Sketches Support to Spark SQL* > This proposal aims to integrate Apache DataSketches' Theta Sketches into > Spark SQL and the DataFrame APIs. While Spark already includes support for > HLL sketches via {{hll_sketch_*}} functions, Theta Sketches provide > additional capabilities not covered by HLL, such as set operations > (intersection, difference) and value aggregation by key. > *Motivation:* > * Enable scalable and memory-efficient approximate set operations for > large-scale analytics. > * Support high-cardinality distinct counts, approximate aggregations over > batch and streaming data, and set-based operations across datasets. > * Leverage Apache DataSketches, which is already an Apache incubating > project and a dependency within Spark. > *Proposed Features:* > * New aggregate functions for Theta Sketches, including: > ** {{theta_sketch_agg(col)}} — build Theta sketches > ** {{theta_union(sketch1, sketch2)}} and {{theta_union_agg(sketch_col)}} — > union operations > ** {{theta_intersection(sketch1, sketch2)}} and > {{theta_intersection_agg(sketch_col)}} — intersection operations > ** {{theta_difference(sketch1, sketch2)}} — difference operation > ** {{theta_sketch_estimate(sketch)}} — estimate cardinality > * Similar functions to support Tuple Sketches, prioritized after Theta > sketches are integrated. > *Implementation Notes:* > * Follow naming and design conventions established by existing HLL sketch > UDFs. > * Engage with the Apache DataSketches community for technical guidance and > cross-project synergy. > This enhancement will enable Spark users to perform advanced approximate > analytics with improved performance and scalability, complementing existing > approximate functions. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org