[ https://issues.apache.org/jira/browse/SPARK-19222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15823389#comment-15823389 ]
Yadong Qi edited comment on SPARK-19222 at 1/16/17 2:56 AM: ------------------------------------------------------------ Hi [~maropu], sample means `TABLESAMPLE(x ROWS)` or `TABLESAMPLE(x PERCENT)`, the physical plan of `TABLESAMPLE(x ROWS)` is same to LIMIT, so I think you mean `TABLESAMPLE(x PERCENT)`. User's query like `create table t1 as select * from dest1 where phoneNum = 'xxx' limit 10000000` and want to get 10000000 records as more as possible, table t1 will be analyzed later. We don't know the number of records about the subquery `select * from dest1 where phoneNum = 'xxx'`, so we can't know the percent. was (Author: waterman): Hi [~maropu], sample means `TABLESAMPLE(x ROWS)` or `TABLESAMPLE(x PERCENT)`, the physical of `TABLESAMPLE(x ROWS)` is same to limit, so I think you mean `TABLESAMPLE(x PERCENT)`. User's query like `create table t1 as select * from dest1 where phoneNum = 'xxx' limit 10000000` and want to get 10000000 records as more as possible, table t1 will be analyzed later. We don't know the number of records about the subquery `select * from dest1 where phoneNum = 'xxx'`, so we can't know the percent. > Limit Query Performance issue > ----------------------------- > > Key: SPARK-19222 > URL: https://issues.apache.org/jira/browse/SPARK-19222 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0 > Environment: Linux/Windows > Reporter: Sujith > Priority: Minor > > Performance/memory bottle neck occurs in the below mentioned query > case 1: > create table t1 as select * from dest1 limit 10000000; > case 2: > create table t1 as select * from dest1 limit 1000; > pre-condition : partition count >=10000 > In above cases limit is being added in the terminal of the physical plan > == Physical Plan == > ExecutedCommand > +- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2, > InsertIntoHiveTable] > +- GlobalLimit 10000000 > +- LocalLimit 10000000 > +- Project [imei#101, age#102, task#103L, num#104, level#105, > productdate#106, name#107, point#108] > +- SubqueryAlias hive > +- > Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108] > csv | > Issue Hints: > Possible Bottleneck snippet in limit.scala file under spark-sql package. > protected override def doExecute(): RDD[InternalRow] = { > val locallyLimited = child.execute().mapPartitionsInternal(_.take(limit)) > val shuffled = new ShuffledRowRDD( > ShuffleExchange.prepareShuffleDependency( > locallyLimited, child.output, SinglePartition, serializer)) > shuffled.mapPartitionsInternal(_.take(limit)) > } > As mentioned in above case 1 (where limit value is 10000000 or partition > count is > 10000) and case 2(limit value is small(around 1000)), As per the > above snippet when the ShuffledRowRDD > is created by grouping all the limit data from different partitions to a > single partition in executer, memory issue occurs since all the partition > limit data will be collected and > grouped in a single partition for processing, in both former/later case the > data count can go very high which can create the memory bottleneck. > Proposed solution for case 2: > An accumulator value can be to send to all partitions, all executor will be > updating the accumulator value based on the data fetched , > eg: Number of partition = 100, number of cores =10 > Ideally tasks will be launched in a group of 10 task/core, once the first > group finishes the tasks driver will check whether the accumulator value is > been reached the limit value if its reached then no further tasks will be > launched to executors and the result after applying limit will be returned. > Please let me now for any suggestions or solutions for the above mentioned > problems > Thanks, > Sujith -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org