[jira] [Comment Edited] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

liyunzhang_intel (JIRA) Tue, 13 Jun 2017 21:29:23 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048690#comment-16048690
 ]


liyunzhang_intel edited comment on HIVE-16600 at 6/14/17 4:28 AM:
------------------------------------------------------------------

[~lirui]: 
bq. yeah we can update it when HIVE-6348 is fixed. Right now we need it 
otherwise the patch is not really tested. Please update the RB accordingly.

sorry for forgetting to update code on review board. now have updated code on 
review board. 



was (Author: kellyzly):
[~lirui]: 
bq. yeah we can update it when HIVE-6348 is fixed. Right now we need it 
otherwise the patch is not really tested. Please update the RB accordingly.
actually this case is included in HIVE-16600.12.patch and the code has been 
updated in the RB.


> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-16600
>                 URL: https://issues.apache.org/jira/browse/HIVE-16600
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>         Attachments: HIVE-16600.10.patch, HIVE-16600.11.patch, 
> HIVE-16600.12.patch, HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, HIVE-16600.8.patch, 
> HIVE-16600.9.patch, mr.explain, mr.explain.log.HIVE-16600, Node.java, 
> TestSetSparkReduceParallelism_MultiInsertCase.java
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
>     SELECT key, value
> INSERT OVERWRITE TABLE e2
>     SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>    TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
>                             -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

Reply via email to