[ https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
liyunzhang_intel updated HIVE-16600: ------------------------------------ Attachment: HIVE-16600.1.patch [~lirui]: I update HIVE-16600.1.patch. help review {noformat} in the case i provided in description, after HIVE-16600.1.patch, the parallelism of RS[2] (Sort)is 46. without HIVE-16600.1.patch, the parallelism of RS[2](Sort) is 1. before HIVE-16600.1.patch the parallelism of RS[2] is 1 #grep SetSparkReducerParallelism logs/hive.log 2017-05-08T10:31:10,820 INFO [63ddd225-f012-4b14-9141-38597f94c85b main] spark.SetSparkReducerParallelism: Number of reducers determined to be: 1 after HIVE-16600.1.patch the parallelism of RS[2] is 46 #grep SetSparkReducerParallelism logs/hive.log 2017-05-08T10:22:49,432 DEBUG [42c701ac-380e-43e3-a3ab-f5aa7c2b55ee main] spark.SetSparkReducerParallelism: Sibling RS[2] has stats: Num rows: 29 Data size: 5812 Basic stats: COMPLETE Column stats: NONE 2017-05-08T10:23:10,403 INFO [42c701ac-380e-43e3-a3ab-f5aa7c2b55ee main] spark.SetSparkReducerParallelism: Set parallelism for reduce sink RS[2] to: 46 (calculated) {noformat} > Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel > order by in multi_insert cases > -------------------------------------------------------------------------------------------------------- > > Key: HIVE-16600 > URL: https://issues.apache.org/jira/browse/HIVE-16600 > Project: Hive > Issue Type: Sub-task > Reporter: liyunzhang_intel > Assignee: liyunzhang_intel > Attachments: HIVE-16600.1.patch > > > multi_insert_gby.case.q > {code} > set hive.exec.reducers.bytes.per.reducer=256; > set hive.optimize.sampling.orderby=true; > drop table if exists e1; > drop table if exists e2; > create table e1 (key string, value string); > create table e2 (key string); > FROM (select key, cast(key as double) as keyD, value from src order by key) a > INSERT OVERWRITE TABLE e1 > SELECT key, value > INSERT OVERWRITE TABLE e2 > SELECT key; > select * from e1; > select * from e2; > {code} > the parallelism of Sort is 1 even we enable parallel order > by("hive.optimize.sampling.orderby" is set as "true"). This is not > reasonable because the parallelism should be calcuated by > [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170] > this is because SetSparkReducerParallelism#needSetParallelism returns false > when [children size of > RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207] > is greater than 1. > in this case, the children size of {{RS[2]}} is two. > the logical plan of the case > {code} > TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5] > -SEL[6]-FS[7] > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)