Kent Yao created SPARK-56546:
--------------------------------
Summary: [SQL] Optimize sliding window aggregation with segment
tree
Key: SPARK-56546
URL: https://issues.apache.org/jira/browse/SPARK-56546
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.2.0
Reporter: Kent Yao
h3. Background
Spark's {{SlidingWindowFunctionFrame}} (sql/core,
{{WindowFunctionFrame.scala}}) recomputes the aggregate over the entire frame
for every output row in O(n*W) time, where n is the partition size and W is the
frame width. This is quadratic in W for sliding-window workloads (e.g., rolling
MAX, AVG, STDDEV over sensor time-series or financial tick data) and dominates
wall-clock on windows wider than a few hundred rows.
Prior art: Leis et al., "Efficient Processing of Window Functions in Analytical
SQL Queries" (VLDB 2015) -- a segment tree per partition gives O(n*log(W)) for
any mergeable (associative) aggregate, including non-invertible ones (MIN, MAX,
generic {{DeclarativeAggregate}} with {{mergeExpressions}}). DuckDB and Umbra
use the same technique.
h3. Proposal
Introduce a segment-tree-backed {{WindowFunctionFrame}} path for bounded
sliding ROWS and RANGE frames over {{DeclarativeAggregate}} functions
(non-invertible and invertible alike). The new path is gated by a default-off
SQLConf so it is opt-in and fully reversible.
Key elements:
* New {{WindowSegmentTree}} data structure (block-chunked; leaves = cached
{{UnsafeRow}} blocks; internal nodes = pre-merged aggregate buffers).
* Integration via {{SegmentTreeWindowFunctionFrame}}, selected by
{{WindowEvaluatorFactoryBase}} when the feature flag is on and the frame is
sliding / mergeable.
* Memory integration with {{TaskMemoryManager}} (block-chunked LRU cache, spill
via {{ExternalAppendOnlyUnsafeRowArray}}), so it does not regress low-memory
operators.
* Observability: two new SQLMetrics ({{numSegmentTreeFrames}},
{{numSegmentTreeFallbackFrames}}) and EXPLAIN visibility.
* Fallback: any unsupported case (ImperativeAggregate, UDAF via ScalaUDAF,
RANGE with multi-column ORDER BY, etc.) routes to the existing
{{SlidingWindowFunctionFrame}} with no behavioral change.
h3. Expected benefit
Preliminary in-tree {{WindowBenchmark}} numbers (GHA fork run, JDK 17/21/25, 2M
rows unless noted):
* MIN / MAX / SUM / AVG / COUNT: ~9-10x at W=1001
* STDDEV_SAMP: ~18x at W=1001 (stress case)
* W sweep at W=4001: ~31x (scales with W as expected: O(W/log W))
* String / spill path: ~17x
Final numbers will be refreshed from the committed
{{sql/core/benchmarks/WindowBenchmark-results.txt}} once the current GHA matrix
lands.
h3. Scope
Phase 1 (this umbrella's first wave of PRs):
* Bounded ROWS and RANGE sliding frames
* {{DeclarativeAggregate}} with {{mergeExpressions}}
* Default-off SQLConf, fallback on any unsupported shape
Phase 2 (separate JIRAs, not blocking):
* FAME prefix/suffix within-block cumulative
* Adaptive block size, fanout tuning
* Invertible fast-path for SUM/COUNT/AVG
h3. Safety / rollback
* SQLConf {{spark.sql.window.segmentTree.enabled}} defaults to {{false}}.
Setting it back to {{false}} restores the exact pre-feature execution path.
* All existing window tests must stay green with the flag on and off; a new
{{SegmentTreeWindowFunctionSuite}} covers edge cases (NULL, NaN, Decimal
overflow, Binary, UDAF fallback, RANGE with ties / NULL keys, small-partition
fallback, spill).
h3. Related work
* Leis et al., VLDB 2015: "Efficient Processing of Window Functions in
Analytical SQL Queries".
* DuckDB segment-tree window implementation.
* Velox PR #9049 (partial FAME attempt; documented pitfalls informed this
design).
h3. Subtasks
Subtasks will be filed as the implementation splits into PR-sized units (data
structure, frame integration, memory/spill, metrics, RANGE support, benchmark).
A split-strategy document will accompany the first subtask.
h3. Design doc
A public design doc summary will be linked here before the first subtask PR is
opened.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]