[jira] [Updated] (SPARK-19222) Limit Query Performance issue

Hyukjin Kwon (JIRA) Mon, 20 May 2019 21:27:51 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-19222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon updated SPARK-19222:
---------------------------------
    Labels: bulk-closed  (was: )

> Limit Query Performance issue
> -----------------------------
>
>                 Key: SPARK-19222
>                 URL: https://issues.apache.org/jira/browse/SPARK-19222
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>         Environment: Linux/Windows
>            Reporter: Sujith Chacko
>            Priority: Minor
>              Labels: bulk-closed
>
> When limit is being added in the middle of the physical plan there will 
> be possibility of memory bottleneck 
> if the limit value is too large and system will try to aggregate all the 
> partition limit values as part of single partition. 
> Description: 
> Eg: 
> create table src_temp as select * from src limit n;    (n=10000000) 
> == Physical Plan  == 
> ExecutedCommand 
>    +- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2, 
> InsertIntoHiveTable] 
>          +- GlobalLimit 10000000 
>             +- LocalLimit 10000000 
>                +- Project [imei#101, age#102, task#103L, num#104, level#105, 
> productdate#106, name#107, point#108] 
>                   +- SubqueryAlias hive 
>                      +- 
> Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108]
>  csv  |
> As shown in above plan when the limit comes in middle,there can be two 
> types of performance bottlenecks. 
> scenario 1: when the partition count is very high and limit value is small 
> scenario 2: when the limit value is very large 
> Eg,current scenario based on following sample data of limit count is 10000000 
> and partition count  5 
> Local Limit -------- > |partition 1||partition 2||partition 3||partition 
> 4||partition 5|
>   ---------------------->  <<take n>><<take n>><<take n>><<take n>><<take n>>
>                                                                     |
>                                          Shuffle Exchange(into single 
> partition)     
>                                                                     |
>      Global Limit -------- >                    << take n>>  (all the 
> partition data will be grouped in single partition)                           
>     
>   
> as the above scenario occurs where system will shuffle and try to group the 
> limit data from all partition 
> to single partition which will induce performance bottleneck. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19222) Limit Query Performance issue

Reply via email to