[jira] [Comment Edited] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

Xuefu Zhang (JIRA) Wed, 17 May 2017 09:58:43 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16014389#comment-16014389
 ]


Xuefu Zhang edited comment on HIVE-16600 at 5/17/17 4:57 PM:
-------------------------------------------------------------

Hi [~lirui], For your example, my understanding is that the target table isn't 
expected to contained top 5, ordered result from {{A}}. Assuming {{A}} is a 
regular table that happens to be ordered,{{select A.k from A limit 5}} can 
choose any 5 records from A. Both orderby and limit clause is to control the 
query output. In your example, the orderby clause is to control the output of 
the subquery, which is {{A}}. That means that {{A}} is expected to contain all 
ordered rows in {{src}}. The limit clause is to control the output of the 
second select statement, which has no confusion. Even if {{A}} is ordered, the 
second {{select ... from}} can completely ignore that order and return an 
unordered set of A, which is further limited to any 5 rows by the limit clause. 
I don't think {{select}} is equivalent to {{fetch first}} introduced in SQL2008 
or {{select first}} supported by some DB vendors. 

If a user wants the top-5 records from src, your example can be rewritten as 
{{FROM src insert overwrite table target select src.k order by src.k limit 5}}, 
which may further clarify why in your example the target table is not expected 
to have top-5 records from src.


was (Author: xuefuz):
Hi [~lirui], For your example, my understanding is that the target table isn't 
expected to contained top 5, ordered result from {{A}}. Assuming {{A}} is a 
regular table that happens to be ordered,{{select A.k from A limit 5}} can 
choose any 5 records from A. Both orderby and limit clause is to control the 
query output. In your example, the orderby clause is to control the output of 
the subquery, which is {{A}}. That means that {{A}} is expected to contain all 
ordered rows in {{src}}. The limit clause is to control the output of the 
second select statement, which has no confusion. Even if {{A}} is ordered, the 
second {{select ... from}} can completely ignore that order and return an 
unordered set of A, which is further limited to any 5 rows by the limit clause. 
I don't think {{select}} is equivalent to {{fetch first}} introduced in SQL2008 
or {{select first}} supported by some DB vendors. 

If a user wants the top-5 records from src, your example can be rewritten as {{ 
FROM src insert overwrite table target select src.k order by src.k limit 5}}, 
which may further clarify why in your example the target table is not expected 
to have top-5 records from src.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-16600
>                 URL: https://issues.apache.org/jira/browse/HIVE-16600
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>         Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, mr.explain, 
> mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
>     SELECT key, value
> INSERT OVERWRITE TABLE e2
>     SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>    TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
>                             -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

Reply via email to