[ https://issues.apache.org/jira/browse/SPARK-19222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-19222: --------------------------------- Labels: bulk-closed (was: ) > Limit Query Performance issue > ----------------------------- > > Key: SPARK-19222 > URL: https://issues.apache.org/jira/browse/SPARK-19222 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0 > Environment: Linux/Windows > Reporter: Sujith Chacko > Priority: Minor > Labels: bulk-closed > > When limit is being added in the middle of the physical plan there will > be possibility of memory bottleneck > if the limit value is too large and system will try to aggregate all the > partition limit values as part of single partition. > Description: > Eg: > create table src_temp as select * from src limit n; (n=10000000) > == Physical Plan == > ExecutedCommand > +- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2, > InsertIntoHiveTable] > +- GlobalLimit 10000000 > +- LocalLimit 10000000 > +- Project [imei#101, age#102, task#103L, num#104, level#105, > productdate#106, name#107, point#108] > +- SubqueryAlias hive > +- > Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108] > csv | > As shown in above plan when the limit comes in middle,there can be two > types of performance bottlenecks. > scenario 1: when the partition count is very high and limit value is small > scenario 2: when the limit value is very large > Eg,current scenario based on following sample data of limit count is 10000000 > and partition count 5 > Local Limit -------- > |partition 1||partition 2||partition 3||partition > 4||partition 5| > ----------------------> <<take n>><<take n>><<take n>><<take n>><<take n>> > | > Shuffle Exchange(into single > partition) > | > Global Limit -------- > << take n>> (all the > partition data will be grouped in single partition) > > > as the above scenario occurs where system will shuffle and try to group the > limit data from all partition > to single partition which will induce performance bottleneck. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org