[
https://issues.apache.org/jira/browse/SPARK-56546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-56546:
----------------------------------
Parent: SPARK-54137
Issue Type: Sub-task (was: Improvement)
> [SQL] Optimize sliding window aggregation with segment tree
> -----------------------------------------------------------
>
> Key: SPARK-56546
> URL: https://issues.apache.org/jira/browse/SPARK-56546
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 4.2.0
> Reporter: Kent Yao
> Assignee: Kent Yao
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.2.0
>
>
> h2. Motivation
> Apache Spark's current sliding-window aggregate evaluation recomputes the
> full frame for every output row when no inverse exists (MIN/MAX, or when the
> inverse is numerically unstable — see SPARK-36844). This makes W-large
> sliding frames O(N×W) even though O(N log W) algorithms are well-known in the
> streaming-analytics literature (Leis et al. 2015, Arasu & Widom 2004).
> Common symptoms: sliding {{MIN/MAX/SUM/AVG/COUNT/STDDEV}} queries over W ≳
> 1000 rows become CPU-bound, spill-sensitive (String/binary types), and scale
> poorly with W. At W = 4001 / 2M rows, the current engine takes ~100 s per
> sliding pass on modern hardware.
> h2. Proposal
> Introduce a block-chunked segment-tree window function frame that evaluates
> sliding aggregates in O(log(W)) per output row, gated behind a new SQLConf
> {{spark.sql.window.segmentTree.enabled}} (default *false* — opt-in) and
> available from Spark 4.2.0.
> The implementation is internal-only (no public API change, no new SQL
> syntax); a sibling {{SegmentTreeWindowFunctionFrame}} is selected by
> {{WindowFunctionFrameFactory}} whenever (1) the frame is a non-invertible
> sliding range/rows frame, (2) the aggregate is segment-tree-eligible
> (MIN/MAX/SUM/COUNT/AVG/STDDEV_POP/STDDEV_SAMP/VAR_POP/VAR_SAMP), and (3) the
> SQLConf is enabled. All other frames fall back to the existing path unchanged.
> Trade-offs are documented inline: for very small W (≤ ~50) the segment-tree's
> block-chunked build overhead exceeds naive recomputation — those cases are a
> Pareto-loss zone (see "Caveats" below).
> h2. Expected benefit
> 3-JDK GHA benchmark (JDK 17 / 21 / 25 on Azure EPYC 9V74 80-core, Linux 6.17,
> OpenJDK LTS), WindowBenchmark resized to a 3–5 s per-iteration baseline:
> || Aggregate || W / N || JDK 17 || JDK 21 || JDK 25 ||
> | MIN (non-invertible) | 1001 / 256K | 8.9× | 9.0× | 9.0× |
> | MAX (non-invertible) | 1001 / 256K | 10.2× | 13.1× | 14.2× |
> | SUM (full recompute) | 1001 / 256K | 9.6× | 9.5× | 9.9× |
> | COUNT | 1001 / 256K | 9.4× | 9.3× | 9.3× |
> | AVG (multi-buffer) | 1001 / 192K | 10.6× | 10.6× | 10.2× |
> | STDDEV_SAMP (multi-buffer, stress) | 1001 / 2M | 18.7× | 19.3× | 18.5× |
> | SUM W-sweep (stress, cross-block) | 4001 / 2M | 32.4× | 32.6× | 29.0× |
> | MAX String spill guard (stress) | 1001 / 1M | 15.7× | 16.4× | 17.6× |
> N-sweep (segtree-only, W=1001, stress, JDK 17 per-row nanoseconds): N=2M →
> 1358 ns/row, N=8M → 1374 ns/row, N=16M → 1338 ns/row — *sub-linear /
> near-constant*, refuting any cache-resident hypothesis and matching the O(log
> W) cost model.
> h3. Caveats (documented, not blockers)
> * W ≤ 50 *Pareto-loss zone*: naive is ~2–3× faster than segtree (expected —
> segtree build overhead dominates). Covered by explicit {{(stress)}} benchmark
> cases W=11 / W=51. The feature is opt-in precisely so users/tuners can stay
> on naive for small-W workloads.
> * W=10 pathological (0.4×): same cause, called out in the design document.
> * Segtree is a secondary path only; *the naive frame is untouched* and
> remains the default.
> h2. Scope
> * Component: SQL
> * Affects: 4.2.0 (master)
> * No public API / SQL grammar change
> * New SQLConf: {{spark.sql.window.segmentTree.enabled}} (boolean, default
> false, since 4.2.0)
> * New SQLMetrics on the window operator: segment-tree frame construction
> counts
> * All existing window tests remain green; a new unit suite, a property-based
> suite, and {{WindowBenchmark}} (3-JDK GHA) cover the new path.
> h2. Design document
> Consolidated design doc will be linked from the upstream PR description once
> it is opened.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]