[jira] [Updated] (SPARK-19222) Limit Query Performance issue

Sujith (JIRA) Wed, 18 Jan 2017 00:20:08 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-19222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sujith updated SPARK-19222:
---------------------------
    Description: 
When limit is being added in the middle of the physical plan there will 
be possibility of memory bottleneck 
if the limit value is too large and system will try to aggregate all the 
partition limit values as part of single partition. 
Description: 
Eg: 
create table src_temp as select * from src limit n;    (n=10000000) 

== Physical Plan  == 
ExecutedCommand 
   +- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2, 
InsertIntoHiveTable] 
         +- GlobalLimit 10000000 
            +- LocalLimit 10000000 
               +- Project [imei#101, age#102, task#103L, num#104, level#105, 
productdate#106, name#107, point#108] 
                  +- SubqueryAlias hive 
                     +- 
Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108]
 csv  |

As shown in above plan when the limit comes in middle,there can be two 
types of performance bottlenecks. 
scenario 1: when the partition count is very high and limit value is small 
scenario 2: when the limit value is very large 


Eg,current scenario based on following sample data of limit count is 10000000 
and partition count  5 

Local Limit -------- > |partition 1||partition 2||partition 3||partition 
4||partition 5|
  ---------------------->  <<take n>><<take n>><<take n>><<take n>><<take n>>
                                                                    |
                                         Shuffle Exchange(into single 
partition)     
                                                                    |
     Global Limit -------- >                    << take n>>  (all the partition 
data will be grouped in single partition)                               
  
as the above scenario occurs where system will shuffle and try to group the 
limit data from all partition 
to single partition which will induce performance bottleneck. 

  was:
When limit is being added in the middle of the physical plan there will 
be possibility of memory bottleneck 
if the limit value is too large and system will try to aggregate all the 
partition limit values as part of single partition. 
Description: 
Eg: 
create table src_temp as select * from src limit n;    (n=10000000) 

== Physical Plan  == 
ExecutedCommand 
   +- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2, 
InsertIntoHiveTable] 
         +- GlobalLimit 10000000 
            +- LocalLimit 10000000 
               +- Project [imei#101, age#102, task#103L, num#104, level#105, 
productdate#106, name#107, point#108] 
                  +- SubqueryAlias hive 
                     +- 
Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108]
 csv  |

As shown in above plan when the limit comes in middle,there can be two 
types of performance bottlenecks. 
scenario 1: when the partition count is very high and limit value is small 
scenario 2: when the limit value is very large 


Eg,current scenario based on following sample data of limit count is 10000000 
and partition count  5 

Local Limit -------- > |partition 1|   |partition 2|   |partition 3|   
|partition 4|   |partition 5|
                                   take n           take n           take n     
        take n          take n 
                                          
                                         Shuffle Exchange(single partition)     
                        
Global Limit -------- >                      take n  (all the partition data 
will be grouped in single partition)                               
  
as the above scenario occurs where system will shuffle and try to group the 
limit data from all partition 
to single partition which will induce performance bottleneck. 


> Limit Query Performance issue
> -----------------------------
>
>                 Key: SPARK-19222
>                 URL: https://issues.apache.org/jira/browse/SPARK-19222
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>         Environment: Linux/Windows
>            Reporter: Sujith
>            Priority: Minor
>
> When limit is being added in the middle of the physical plan there will 
> be possibility of memory bottleneck 
> if the limit value is too large and system will try to aggregate all the 
> partition limit values as part of single partition. 
> Description: 
> Eg: 
> create table src_temp as select * from src limit n;    (n=10000000) 
> == Physical Plan  == 
> ExecutedCommand 
>    +- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2, 
> InsertIntoHiveTable] 
>          +- GlobalLimit 10000000 
>             +- LocalLimit 10000000 
>                +- Project [imei#101, age#102, task#103L, num#104, level#105, 
> productdate#106, name#107, point#108] 
>                   +- SubqueryAlias hive 
>                      +- 
> Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108]
>  csv  |
> As shown in above plan when the limit comes in middle,there can be two 
> types of performance bottlenecks. 
> scenario 1: when the partition count is very high and limit value is small 
> scenario 2: when the limit value is very large 
> Eg,current scenario based on following sample data of limit count is 10000000 
> and partition count  5 
> Local Limit -------- > |partition 1||partition 2||partition 3||partition 
> 4||partition 5|
>   ---------------------->  <<take n>><<take n>><<take n>><<take n>><<take n>>
>                                                                     |
>                                          Shuffle Exchange(into single 
> partition)     
>                                                                     |
>      Global Limit -------- >                    << take n>>  (all the 
> partition data will be grouped in single partition)                           
>     
>   
> as the above scenario occurs where system will shuffle and try to group the 
> limit data from all partition 
> to single partition which will induce performance bottleneck. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19222) Limit Query Performance issue

Reply via email to