[jira] [Updated] (SPARK-56546) [SQL] Optimize sliding window aggregation with segment tree

Dongjoon Hyun (Jira) Fri, 15 May 2026 12:43:33 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-56546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dongjoon Hyun updated SPARK-56546:
----------------------------------
        Parent: SPARK-54137
    Issue Type: Sub-task  (was: Improvement)

> [SQL] Optimize sliding window aggregation with segment tree
> -----------------------------------------------------------
>
>                 Key: SPARK-56546
>                 URL: https://issues.apache.org/jira/browse/SPARK-56546
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 4.2.0
>            Reporter: Kent Yao
>            Assignee: Kent Yao
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.2.0
>
>
> h2. Motivation
> Apache Spark's current sliding-window aggregate evaluation recomputes the 
> full frame for every output row when no inverse exists (MIN/MAX, or when the 
> inverse is numerically unstable — see SPARK-36844). This makes W-large 
> sliding frames O(N×W) even though O(N log W) algorithms are well-known in the 
> streaming-analytics literature (Leis et al. 2015, Arasu & Widom 2004).
> Common symptoms: sliding {{MIN/MAX/SUM/AVG/COUNT/STDDEV}} queries over W ≳ 
> 1000 rows become CPU-bound, spill-sensitive (String/binary types), and scale 
> poorly with W. At W = 4001 / 2M rows, the current engine takes ~100 s per 
> sliding pass on modern hardware.
> h2. Proposal
> Introduce a block-chunked segment-tree window function frame that evaluates 
> sliding aggregates in O(log(W)) per output row, gated behind a new SQLConf 
> {{spark.sql.window.segmentTree.enabled}} (default *false* — opt-in) and 
> available from Spark 4.2.0.
> The implementation is internal-only (no public API change, no new SQL 
> syntax); a sibling {{SegmentTreeWindowFunctionFrame}} is selected by 
> {{WindowFunctionFrameFactory}} whenever (1) the frame is a non-invertible 
> sliding range/rows frame, (2) the aggregate is segment-tree-eligible 
> (MIN/MAX/SUM/COUNT/AVG/STDDEV_POP/STDDEV_SAMP/VAR_POP/VAR_SAMP), and (3) the 
> SQLConf is enabled. All other frames fall back to the existing path unchanged.
> Trade-offs are documented inline: for very small W (≤ ~50) the segment-tree's 
> block-chunked build overhead exceeds naive recomputation — those cases are a 
> Pareto-loss zone (see "Caveats" below).
> h2. Expected benefit
> 3-JDK GHA benchmark (JDK 17 / 21 / 25 on Azure EPYC 9V74 80-core, Linux 6.17, 
> OpenJDK LTS), WindowBenchmark resized to a 3–5 s per-iteration baseline:
> || Aggregate || W / N || JDK 17 || JDK 21 || JDK 25 ||
> | MIN (non-invertible) | 1001 / 256K | 8.9× | 9.0× | 9.0× |
> | MAX (non-invertible) | 1001 / 256K | 10.2× | 13.1× | 14.2× |
> | SUM (full recompute) | 1001 / 256K | 9.6× | 9.5× | 9.9× |
> | COUNT | 1001 / 256K | 9.4× | 9.3× | 9.3× |
> | AVG (multi-buffer) | 1001 / 192K | 10.6× | 10.6× | 10.2× |
> | STDDEV_SAMP (multi-buffer, stress) | 1001 / 2M | 18.7× | 19.3× | 18.5× |
> | SUM W-sweep (stress, cross-block) | 4001 / 2M | 32.4× | 32.6× | 29.0× |
> | MAX String spill guard (stress) | 1001 / 1M | 15.7× | 16.4× | 17.6× |
> N-sweep (segtree-only, W=1001, stress, JDK 17 per-row nanoseconds): N=2M → 
> 1358 ns/row, N=8M → 1374 ns/row, N=16M → 1338 ns/row — *sub-linear / 
> near-constant*, refuting any cache-resident hypothesis and matching the O(log 
> W) cost model.
> h3. Caveats (documented, not blockers)
> * W ≤ 50 *Pareto-loss zone*: naive is ~2–3× faster than segtree (expected — 
> segtree build overhead dominates). Covered by explicit {{(stress)}} benchmark 
> cases W=11 / W=51. The feature is opt-in precisely so users/tuners can stay 
> on naive for small-W workloads.
> * W=10 pathological (0.4×): same cause, called out in the design document.
> * Segtree is a secondary path only; *the naive frame is untouched* and 
> remains the default.
> h2. Scope
> * Component: SQL
> * Affects: 4.2.0 (master)
> * No public API / SQL grammar change
> * New SQLConf: {{spark.sql.window.segmentTree.enabled}} (boolean, default 
> false, since 4.2.0)
> * New SQLMetrics on the window operator: segment-tree frame construction 
> counts
> * All existing window tests remain green; a new unit suite, a property-based 
> suite, and {{WindowBenchmark}} (3-JDK GHA) cover the new path.
> h2. Design document
> Consolidated design doc will be linked from the upstream PR description once 
> it is opened.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-56546) [SQL] Optimize sliding window aggregation with segment tree

Reply via email to