[ https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16032067#comment-16032067 ]
liyunzhang_intel commented on HIVE-16600: ----------------------------------------- [~lirui]: thanks for review. I think my algorithm in v9 patch will consider above case a multi-insert case as there is more than 1 path to RS to FS {code} RS - ... - FS - ... - FS - ... - Non FS {code} {code} // the multi insert case is like // TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5] // -SEL[6]-LIM[7]-RS[8]-SEL[9]-LIM[10]-FS[11] // verify Multi Insert case: if there are more than 1 path from RS(RS[2]) to FS in the operator tree, it is a multi-insert // case private boolean isMultiInsert(ReduceSinkOperator rs) { int pathToFSNum = 0; Deque<Operator<?>> childQueue = new LinkedList<>(); childQueue.addAll(rs.getChildOperators()); while (!childQueue.isEmpty()) { Operator<?> child = childQueue.pop(); if (child instanceof FileSinkOperator) { pathToFSNum = pathToFSNum + 1; } else { childQueue.addAll(child.getChildOperators()); } } boolean isMultiInsert = pathToFSNum > 1 ? true : false; LOG.debug("reducesink:" + rs + " isMultiInsert:" + isMultiInsert); return isMultiInsert; } {code} what i am confused is above case is not a multi insert case? Is it a case about spark dynamic partition pruning? bq.Besides, I'm wondering whether it's better to avoid such order by in sub-queries in the first place, as it is essentially pointless. Agree > Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel > order by in multi_insert cases > -------------------------------------------------------------------------------------------------------- > > Key: HIVE-16600 > URL: https://issues.apache.org/jira/browse/HIVE-16600 > Project: Hive > Issue Type: Sub-task > Reporter: liyunzhang_intel > Assignee: liyunzhang_intel > Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, > HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, > HIVE-16600.6.patch, HIVE-16600.7.patch, HIVE-16600.8.patch, > HIVE-16600.9.patch, mr.explain, mr.explain.log.HIVE-16600 > > > multi_insert_gby.case.q > {code} > set hive.exec.reducers.bytes.per.reducer=256; > set hive.optimize.sampling.orderby=true; > drop table if exists e1; > drop table if exists e2; > create table e1 (key string, value string); > create table e2 (key string); > FROM (select key, cast(key as double) as keyD, value from src order by key) a > INSERT OVERWRITE TABLE e1 > SELECT key, value > INSERT OVERWRITE TABLE e2 > SELECT key; > select * from e1; > select * from e2; > {code} > the parallelism of Sort is 1 even we enable parallel order > by("hive.optimize.sampling.orderby" is set as "true"). This is not > reasonable because the parallelism should be calcuated by > [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170] > this is because SetSparkReducerParallelism#needSetParallelism returns false > when [children size of > RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207] > is greater than 1. > in this case, the children size of {{RS[2]}} is two. > the logical plan of the case > {code} > TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5] > -SEL[6]-FS[7] > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)