[ 
https://issues.apache.org/jira/browse/SPARK-55939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bo Xiong updated SPARK-55939:
-----------------------------
    Description: 
h3. Motivation

Spark SQL already provides built-in support for several Apache DataSketches 
algorithms:
 * HyperLogLog (HLL) sketches for approximate count distinct
 * Theta sketches for count distinct with set operations
 * Tuple sketches for count distinct with aggregated summaries
 * KLL sketches for approximate quantiles
 * Top-K sketches for approximate top-K items

This PR adds built-in support for the *DataSketches ItemsSketch* (Frequent 
Items), which tracks the approximate frequency of items in a data stream. 
Unlike {{approx_top_k}} which only returns the top-K items, ItemsSketch 
provides:
 * Frequency estimates for any item (not just the top-K)
 * Configurable error guarantees ({{{}NO_FALSE_POSITIVES{}}} vs 
{{{}NO_FALSE_NEGATIVES{}}})
 * Mergeable binary representations for multi-level rollup aggregation
 * Support for a wide range of data types (boolean, numeric, string, decimal, 
date, timestamp)

h3. Use Cases
 * {*}Finding top sellers across time horizons{*}: Build daily per-category 
sketches, then merge across days to find weekly/monthly top sellers without 
re-scanning raw data.
 * {*}Frequency estimation in streaming{*}: Maintain running frequency sketches 
that can be merged across micro-batches.
 * {*}Multi-dimensional rollup{*}: Build sketches at child-level dimensions and 
merge up to parent dimensions.

  was:
## Description


### Motivation


Spark SQL already provides built-in support for several Apache DataSketches 
algorithms:
- HyperLogLog (HLL) sketches for approximate count distinct
- Theta sketches for count distinct with set operations
- Tuple sketches for count distinct with aggregated summaries
- KLL sketches for approximate quantiles
- Top-K sketches for approximate top-K items


This PR adds built-in support for the **DataSketches ItemsSketch** (Frequent 
Items), which tracks the approximate frequency of items in a data stream. 
Unlike `approx_top_k` which only returns the top-K items, ItemsSketch provides:
- Frequency estimates for any item (not just the top-K)
- Configurable error guarantees (`NO_FALSE_POSITIVES` vs `NO_FALSE_NEGATIVES`)
- Mergeable binary representations for multi-level rollup aggregation
- Support for a wide range of data types (boolean, numeric, string, decimal, 
date, timestamp)


### Use Cases


- **Finding top sellers across time horizons**: Build daily per-category 
sketches, then merge across days to find weekly/monthly top sellers without 
re-scanning raw data.
- **Frequency estimation in streaming**: Maintain running frequency sketches 
that can be merged across micro-batches.
- **Multi-dimensional rollup**: Build sketches at child-level dimensions and 
merge up to parent dimensions.


### Functions Added (6 total)


**Aggregate functions:**
1. `items_sketch_agg(expr [, maxMapSize])` — Builds an ItemsSketch from input 
values. Returns binary.
2. `items_sketch_merge_agg(sketch [, maxMapSize])` — Merges pre-built 
ItemsSketch binaries into one. Returns binary.


**Scalar functions:**
3. `items_sketch_get_frequent_items(sketch, errorType)` — Returns frequent 
items as an array of structs `{item, estimate, lowerBound, upperBound}`. 
`errorType` must be `'NO_FALSE_POSITIVES'` or `'NO_FALSE_NEGATIVES'`.
4. `items_sketch_get_estimate(sketch, item)` — Returns the estimated frequency 
of a specific item.
5. `items_sketch_merge(first, second)` — Merges two ItemsSketch binaries 
(scalar).
6. `items_sketch_to_string(sketch)` — Returns a human-readable summary string.


### Parameters


- `maxMapSize`: Controls sketch accuracy. Must be a power of 2, range [8, 
67108864]. Default: 4096. Larger values use more memory but provide more 
accurate frequency estimates. The effective capacity is `0.75 * maxMapSize`.


> Add built-in DataSketches ItemsSketch (Frequent Items) functions to Spark SQL
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-55939
>                 URL: https://issues.apache.org/jira/browse/SPARK-55939
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 4.2.0
>            Reporter: Bo Xiong
>            Assignee: Bo Xiong
>            Priority: Major
>              Labels: sketch,most-frequent,top-k
>             Fix For: 4.2.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> h3. Motivation
> Spark SQL already provides built-in support for several Apache DataSketches 
> algorithms:
>  * HyperLogLog (HLL) sketches for approximate count distinct
>  * Theta sketches for count distinct with set operations
>  * Tuple sketches for count distinct with aggregated summaries
>  * KLL sketches for approximate quantiles
>  * Top-K sketches for approximate top-K items
> This PR adds built-in support for the *DataSketches ItemsSketch* (Frequent 
> Items), which tracks the approximate frequency of items in a data stream. 
> Unlike {{approx_top_k}} which only returns the top-K items, ItemsSketch 
> provides:
>  * Frequency estimates for any item (not just the top-K)
>  * Configurable error guarantees ({{{}NO_FALSE_POSITIVES{}}} vs 
> {{{}NO_FALSE_NEGATIVES{}}})
>  * Mergeable binary representations for multi-level rollup aggregation
>  * Support for a wide range of data types (boolean, numeric, string, decimal, 
> date, timestamp)
> h3. Use Cases
>  * {*}Finding top sellers across time horizons{*}: Build daily per-category 
> sketches, then merge across days to find weekly/monthly top sellers without 
> re-scanning raw data.
>  * {*}Frequency estimation in streaming{*}: Maintain running frequency 
> sketches that can be merged across micro-batches.
>  * {*}Multi-dimensional rollup{*}: Build sketches at child-level dimensions 
> and merge up to parent dimensions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to