GitHub user hvanhovell opened a pull request: https://github.com/apache/spark/pull/7057
[SPARK-8638] [SQL] Window Function Performance Improvements ## Description Performance improvements for Spark Window functions. This PR will also serve as the basis for moving away from Hive UDAFs to Spark UDAFs. See JIRA tickets SPARK-8638 and SPARK-7712 for more information. The original work including some benchmarking code for the running case can be here: https://github.com/hvanhovell/spark-window ## Improvements * Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases. The current implementation in spark uses a sliding window approach in these cases. This means that an aggregate is maintained for every row, so space usage is N (N being the number of rows). This also means that all these aggregates all need to be updated separately, this takes N*(N-1)/2 updates. The running case differs from the Sliding case because we are only adding data to an aggregate function (no reset is required), we only need to maintain one aggregate (like in the UNBOUNDED PRECEDING AND UNBOUNDED case), update the aggregate for each row, and get the aggregate value after each update. This is what the new implementation does. This approach only uses 1 buffer, and only requires N updates; I am currently working on data with window sizes of 500-1000 doing running sums and this saves a lot of time. The CURRENT ROW AND UNBOUNDED FOLLOWING case als o uses this approach and the fact that aggregate operations are communitative, there is one twist though it will process the input buffer in reverse. * Fewer comparisons in the sliding case. The current implementation determines frame boundaries for every input row. The new implementation makes more use of the fact that the window is sorted, maintains the boundaries, and only moves them when the current row order changes. This is a minor improvement. * A single Window node is able to process all types of Frames for the same Partitioning/Ordering. This saves a little time/memory spent buffering and managing partitions. This will be enabled in a follow-up PR. * A lot of the staging code is moved from the execution phase to the initialization phase. Minor performance improvement, and improves readability of the execution code. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hvanhovell/spark SPARK-8638 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7057.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7057 ---- commit f161920218c880e4f2804b19db129936d0d34d61 Author: Herman van Hovell <hvanhov...@questtec.nl> Date: 2015-06-27T15:39:53Z Major overhaul of Window operator. commit 22e51d39c833d0e69a788ad41b8bd790688b9bd9 Author: Herman van Hovell <hvanhov...@questtec.nl> Date: 2015-06-27T15:54:00Z Fixed "aggregation and range betweens with unbounded" test - two different window frames were compared. commit ad7820c6ac04f188e8a6239d314b680c8a6b4551 Author: Herman van Hovell <hvanhov...@questtec.nl> Date: 2015-06-27T16:00:23Z Disabled Tests 42 & 43 because tiny numerical differences in answers. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org