[ 
https://issues.apache.org/jira/browse/SPARK-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-8638:
-------------------------------------
    Attachment: perf_test2.scala

Additional Performance Test. In this case we test the performance per Frame 
type.

> Window Function Performance Improvements
> ----------------------------------------
>
>                 Key: SPARK-8638
>                 URL: https://issues.apache.org/jira/browse/SPARK-8638
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Herman van Hovell
>         Attachments: perf_test.scala, perf_test2.scala
>
>
> Improve the performance of Spark Window Functions in the following cases:
> # Much better performance (10x) in the running case (e.g. BETWEEN UNBOUNDED 
> PRECEDING AND CURRENT ROW). The current implementation in spark uses a 
> sliding window approach in these cases. This means that an aggregate is 
> maintained for every row, so space usage is N (N being the number of rows). 
> This also means that all these aggregates all need to be updated separately, 
> this takes N*(N-1)/2 updates. The running case differs from the Sliding case 
> because we are only adding data to an aggregate function (no reset is 
> required), we only need to maintain one aggregate (like in the UNBOUNDED 
> PRECEDING AND UNBOUNDED case), update the aggregate for each row, and get the 
> aggregate value after each update. This is what the new implementation does. 
> This approach only uses 1 buffer, and only requires N updates; I am currently 
> working on data with window sizes of 500-1000 doing running sums and this 
> saves a lot of time.
> #. Fewer comparisons in the sliding case. The current implementation 
> determines frame boundaries for every input row. The new implementation makes 
> more use of the fact that the window is sorted, maintains the boundaries, and 
> only moves them when the current row order changes. This is a minor 
> improvement.
> # A single Window node is able to process all types of Frames for the same 
> Partitioning/Ordering. This saves a little time/memory spent buffering and 
> managing partitions. This will be enabled in a follow-up PR.
> # A lot of the staging code is moved from the execution phase to the 
> initialization phase. Minor performance improvement, and improves readability 
> of the execution code.
> The attached perf_test.scala file contains s number of queries which can be 
> used to measure the differences between the current and the proposed window 
> function implementation. In the tests the new implementation outperforms the 
> current implementation by a factor 7x in sliding window cases, and by a 
> factor 14x in the running window cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to