[
https://issues.apache.org/jira/browse/SPARK-54785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Christopher Boumalhab updated SPARK-54785:
------------------------------------------
Description:
This PR introduces three new aggregate functions (`kll_merge_agg_bigint`,
`kll_merge_agg_float`, `kll_merge_agg_double`) that enable efficient merging of
multiple KLL sketch binaries across rows. While the existing scalar
`kll_sketch_merge_*` functions can only merge two sketches at a time, these new
aggregate variants can merge an arbitrary number of pre-computed sketches from
different partitions, time windows, or datasets in a single query. This is
essential for distributed analytics workflows where sketches are computed
independently and later aggregated to obtain global quantile estimates. The
implementation follows the same design patterns as the existing KLL aggregate
functions, accepting an optional k parameter and properly handling NULL values.
https://github.com/apache/spark/pull/53548
was:This PR introduces three new aggregate functions (`kll_merge_agg_bigint`,
`kll_merge_agg_float`, `kll_merge_agg_double`) that enable efficient merging of
multiple KLL sketch binaries across rows. While the existing scalar
`kll_sketch_merge_*` functions can only merge two sketches at a time, these new
aggregate variants can merge an arbitrary number of pre-computed sketches from
different partitions, time windows, or datasets in a single query. This is
essential for distributed analytics workflows where sketches are computed
independently and later aggregated to obtain global quantile estimates. The
implementation follows the same design patterns as the existing KLL aggregate
functions, accepting an optional k parameter and properly handling NULL values.
> Add support for binary sketch aggregations in KLL
> -------------------------------------------------
>
> Key: SPARK-54785
> URL: https://issues.apache.org/jira/browse/SPARK-54785
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.1.0, 4.1.1
> Reporter: Christopher Boumalhab
> Priority: Minor
>
> This PR introduces three new aggregate functions (`kll_merge_agg_bigint`,
> `kll_merge_agg_float`, `kll_merge_agg_double`) that enable efficient merging
> of multiple KLL sketch binaries across rows. While the existing scalar
> `kll_sketch_merge_*` functions can only merge two sketches at a time, these
> new aggregate variants can merge an arbitrary number of pre-computed sketches
> from different partitions, time windows, or datasets in a single query. This
> is essential for distributed analytics workflows where sketches are computed
> independently and later aggregated to obtain global quantile estimates. The
> implementation follows the same design patterns as the existing KLL aggregate
> functions, accepting an optional k parameter and properly handling NULL
> values.
>
> https://github.com/apache/spark/pull/53548
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]