[ https://issues.apache.org/jira/browse/SPARK-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Herman van Hovell tot Westerflier updated SPARK-8638: ----------------------------------------------------- Attachment: perf_test.scala Example code for performance testing the improved Window Function implementation. > Window Function Performance Improvements > ---------------------------------------- > > Key: SPARK-8638 > URL: https://issues.apache.org/jira/browse/SPARK-8638 > Project: Spark > Issue Type: Sub-task > Components: SQL > Reporter: Herman van Hovell tot Westerflier > Attachments: perf_test.scala > > > Improve the performance of Spark Window Functions in the following cases: > # Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED > PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current > implementation in spark uses a sliding window approach in these cases. This > means that an aggregate is maintained for every row, so space usage is N (N > being the number of rows). This also means that all these aggregates all need > to be updated separately, this takes N*(N-1)/2 updates. The running case > differs from the Sliding case because we are only adding data to an aggregate > function (no reset is required), we only need to maintain one aggregate (like > in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each > row, and get the aggregate value after each update. This is what the new > implementation does. This approach only uses 1 buffer, and only requires N > updates; I am currently working on data with window sizes of 500-1000 doing > running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED > FOLLOWING case also uses this approach and the fact that aggregate operations > are communitative, there is one twist though it will process the input buffer > in reverse. > #. Fewer comparisons in the sliding case. The current implementation > determines frame boundaries for every input row. The new implementation makes > more use of the fact that the window is sorted, maintains the boundaries, and > only moves them when the current row order changes. This is a minor > improvement. > # A single Window node is able to process all types of Frames for the same > Partitioning/Ordering. This saves a little time/memory spent buffering and > managing partitions. This will be enabled in a follow-up PR. > # A lot of the staging code is moved from the execution phase to the > initialization phase. Minor performance improvement, and improves readability > of the execution code. > The original work including some benchmarking code for the running case can > be here: https://github.com/hvanhovell/spark-window -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org