sesteves opened a new pull request, #20968:
URL: https://github.com/apache/datafusion/pull/20968

   ## Which issue does this PR close?
   
   Closes #20967.
   
   ## Rationale
   
   Finding the most frequently occurring values in a column is a very common 
analytical pattern — top error codes, most popular products, most active users, 
frequent URL paths, etc. The exact approach (`GROUP BY value ORDER BY COUNT(*) 
DESC LIMIT k`) requires materializing all distinct groups, which is 
memory-intensive and slow on high-cardinality columns.
   
   `approx_top_k` solves this with a streaming approximation algorithm that 
operates in bounded memory, making it practical for large-scale analytics.
   
   This function is already available in ClickHouse (`topK`) and via extensions 
in PostgreSQL and Druid, and is a natural addition to DataFusion's existing 
family of approximate aggregate functions (`approx_distinct`, `approx_median`, 
`approx_percentile_cont`).
   
   ## What changes are included in this PR?
   
   ### Core implementation 
(`datafusion/functions-aggregate/src/approx_top_k.rs`)
   
   - **`SpaceSavingSummary`** — Implements the Filtered Space-Saving algorithm 
(Metwally et al., 2005):
     - Fixed-size counter summary with eviction of minimum-count items when full
     - **Alpha map** for improved accuracy — tracks recently evicted items and 
filters low-frequency noise before it enters the main summary (same approach 
used by ClickHouse)
     - `CAPACITY_MULTIPLIER = 3` (matching ClickHouse's default) — the summary 
tracks `k × 3` counters internally
     - Merge support for parallel/distributed execution
     - Binary serialization/deserialization for intermediate state transfer
   
   - **`ApproxTopK`** — `AggregateUDFImpl` with `#[user_doc]` documentation, 
supporting any hashable scalar type
   
   - **`ApproxTopKAccumulator`** — Accumulator with `update_batch`, 
`merge_batch`, `evaluate` (returns `List(Struct({value: T, count: UInt64}))` 
ordered by count descending), and `state` (serialized summary + data type)
   
   - **9 unit tests** covering basic operations, eviction, alpha map, merge, 
serialization, accumulator update/evaluate, merge_batch, and a distributed 
merge simulation
   
   ### Registration (`datafusion/functions-aggregate/src/lib.rs`)
   - Added module, `expr_fn` re-export, and registration in 
`all_default_aggregate_functions()`
   
   ### SQL logic tests (`datafusion/sqllogictest/test_files/approx_top_k.slt`)
   - 8 tests: basic string/int, GROUP BY, k=1, NULL handling, error cases 
(invalid k), WHERE clause filtering
   
   ### Documentation
   - `docs/source/user-guide/sql/aggregate_functions.md` — TOC entry + full 
documentation section
   - `docs/source/user-guide/expressions.md` — quick-reference table entry
   
   ### Integration tests
   - **Proto roundtrip** 
(`datafusion/proto/tests/cases/roundtrip_logical_plan.rs`) — 
`approx_top_k(col("a"), lit(3))`
   - **DataFrame** (`datafusion/core/tests/dataframe/dataframe_functions.rs`) — 
`test_fn_approx_top_k` with `insta::assert_snapshot!`
   
   ## Are these changes tested?
   
   Yes:
   - 9 unit tests in `approx_top_k.rs` (algorithm correctness, serialization, 
accumulator lifecycle, distributed merge)
   - 8 SQL logic tests in `approx_top_k.slt` (end-to-end SQL behavior including 
edge cases and errors)
   - 1 DataFrame integration test with snapshot assertion
   - 1 proto roundtrip test
   
   ## Are there any user-facing changes?
   
   Yes — new `approx_top_k` aggregate function available in SQL and the 
DataFrame API:
   
   ```sql
   -- Top 5 most frequent values (k defaults to 5)
   SELECT approx_top_k(url_path) FROM access_logs;
   
   -- Top 3 per group
   SELECT region, approx_top_k(product_id, 3) FROM sales GROUP BY region;
   ```
   
   ```rust
   // DataFrame API
   use datafusion::functions_aggregate::approx_top_k::approx_top_k;
   df.aggregate(vec![], vec![approx_top_k(col("url_path"), lit(5))])?;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to