[jira] [Updated] (SPARK-32096) Improve sorting performance for Spark SQL rank window function

Zikun (Jira) Sun, 20 Sep 2020 22:01:10 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-32096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zikun updated SPARK-32096:
--------------------------
    Description: 
Spark SQL rank window function needs to sort the data in each window partition, 
and it relies on the execution operator[ 
|https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsqlhelsinki.visualstudio.com%2Foss%2F_git%2Fspark%3Fpath%3D%252Fsql%252Fcore%252Fsrc%252Fmain%252Fscala%252Forg%252Fapache%252Fspark%252Fsql%252Fexecution%252FSortExec.scala%26version%3DGBsql-2.4%26line%3D37%26lineEnd%3D38%26lineStartColumn%3D1%26lineEndColumn%3D1%26lineStyle%3Dplain&data=02%7C01%7Czixu%40microsoft.com%7Cdc51f9940fc64981c8bd08d7f05ef7c0%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637242163078452885&sdata=HGPm4TbMeJLp9wS0YZmIyqyE4%2BS4Ylw7lebFztX8PWc%3D&reserved=0]
 [*_SortExec_* 
|https://sqlhelsinki.visualstudio.com/oss/_git/spark?path=%2Fsql%2Fcore%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fexecution%2FSortExec.scala&version=GBsql-2.4&line=37&lineEnd=43&lineStartColumn=1&lineEndColumn=1&lineStyle=plain]to
 do the sort. During sorting, the window partition key is also put at the front 
of the sort order and thus it brings unnecessary comparisons on the partition 
key. Instead, we can group the rows by partition key first, and inside each 
group we sort the rows without comparing the partition key. 

 

The Jira https://issues.apache.org/jira/browse/SPARK-32947 is a follow-up 
effort of this improvement.

  was:
Spark SQL rank window function needs to sort the data in each window partition, 
and it relies on the execution operator[ 
|https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsqlhelsinki.visualstudio.com%2Foss%2F_git%2Fspark%3Fpath%3D%252Fsql%252Fcore%252Fsrc%252Fmain%252Fscala%252Forg%252Fapache%252Fspark%252Fsql%252Fexecution%252FSortExec.scala%26version%3DGBsql-2.4%26line%3D37%26lineEnd%3D38%26lineStartColumn%3D1%26lineEndColumn%3D1%26lineStyle%3Dplain&data=02%7C01%7Czixu%40microsoft.com%7Cdc51f9940fc64981c8bd08d7f05ef7c0%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637242163078452885&sdata=HGPm4TbMeJLp9wS0YZmIyqyE4%2BS4Ylw7lebFztX8PWc%3D&reserved=0]
 [*_SortExec_* 
|https://sqlhelsinki.visualstudio.com/oss/_git/spark?path=%2Fsql%2Fcore%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fexecution%2FSortExec.scala&version=GBsql-2.4&line=37&lineEnd=43&lineStartColumn=1&lineEndColumn=1&lineStyle=plain]to
 do the sort. During sorting, the window partition key is also put at the front 
of the sort order and thus it brings unnecessary comparisons on the partition 
key. Instead, we can group the rows by partition key first, and inside each 
group we sort the rows without comparing the partition key. 

 

In Spark SQL, there are two types of sort execution, *_SortExec_* and 
*_TakeOrderedAndProjectExec_* . *_SortExec_* is a general sorting execution and 
it does not support top-N sort. *_TakeOrderedAndProjectExec_* is the execution 
for top-N sort in Spark. Spark SQL rank window function needs to sort the data 
locally and it relies on the execution plan *_SortExec_* to sort the data in 
each physical data partition. When the filter of the window rank (e.g. rank <= 
100) is specified in a user's query, the filter can actually be pushed down to 
the SortExec and then we let SortExec operates top-N sort. Right now SortExec 
does not support top-N sort and we need to extend the capability of SortExec to 
support top-N sort. 


> Improve sorting performance for Spark SQL rank window function
> ---------------------------------------------------------------
>
>                 Key: SPARK-32096
>                 URL: https://issues.apache.org/jira/browse/SPARK-32096
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>         Environment: Any environment that supports Spark.
>            Reporter: Zikun
>            Priority: Major
>         Attachments: windowSortPerf (1).docx
>
>
> Spark SQL rank window function needs to sort the data in each window 
> partition, and it relies on the execution operator[ 
> |https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsqlhelsinki.visualstudio.com%2Foss%2F_git%2Fspark%3Fpath%3D%252Fsql%252Fcore%252Fsrc%252Fmain%252Fscala%252Forg%252Fapache%252Fspark%252Fsql%252Fexecution%252FSortExec.scala%26version%3DGBsql-2.4%26line%3D37%26lineEnd%3D38%26lineStartColumn%3D1%26lineEndColumn%3D1%26lineStyle%3Dplain&data=02%7C01%7Czixu%40microsoft.com%7Cdc51f9940fc64981c8bd08d7f05ef7c0%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637242163078452885&sdata=HGPm4TbMeJLp9wS0YZmIyqyE4%2BS4Ylw7lebFztX8PWc%3D&reserved=0]
>  [*_SortExec_* 
> |https://sqlhelsinki.visualstudio.com/oss/_git/spark?path=%2Fsql%2Fcore%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fsql%2Fexecution%2FSortExec.scala&version=GBsql-2.4&line=37&lineEnd=43&lineStartColumn=1&lineEndColumn=1&lineStyle=plain]to
>  do the sort. During sorting, the window partition key is also put at the 
> front of the sort order and thus it brings unnecessary comparisons on the 
> partition key. Instead, we can group the rows by partition key first, and 
> inside each group we sort the rows without comparing the partition key. 
>  
> The Jira https://issues.apache.org/jira/browse/SPARK-32947 is a follow-up 
> effort of this improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32096) Improve sorting performance for Spark SQL rank window function​

Reply via email to

[jira] [Updated] (SPARK-32096) Improve sorting performance for Spark SQL rank window function