[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-06-14 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048850#comment-16048850
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]: really thanks for review.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.10.patch, HIVE-16600.11.patch, 
> HIVE-16600.12.patch, HIVE-16600.13.patch, HIVE-16600.1.patch, 
> HIVE-16600.2.patch, HIVE-16600.3.patch, HIVE-16600.4.patch, 
> HIVE-16600.5.patch, HIVE-16600.6.patch, HIVE-16600.7.patch, 
> HIVE-16600.8.patch, HIVE-16600.9.patch, mr.explain, 
> mr.explain.log.HIVE-16600, Node.java, 
> TestSetSparkReduceParallelism_MultiInsertCase.java
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-06-14 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048841#comment-16048841
 ] 

Rui Li commented on HIVE-16600:
---

+1

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.10.patch, HIVE-16600.11.patch, 
> HIVE-16600.12.patch, HIVE-16600.13.patch, HIVE-16600.1.patch, 
> HIVE-16600.2.patch, HIVE-16600.3.patch, HIVE-16600.4.patch, 
> HIVE-16600.5.patch, HIVE-16600.6.patch, HIVE-16600.7.patch, 
> HIVE-16600.8.patch, HIVE-16600.9.patch, mr.explain, 
> mr.explain.log.HIVE-16600, Node.java, 
> TestSetSparkReduceParallelism_MultiInsertCase.java
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-06-14 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048827#comment-16048827
 ] 

Hive QA commented on HIVE-16600:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12872933/HIVE-16600.13.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 14 failed/errored test(s), 10817 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=140)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_2] 
(batchId=99)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query16] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query94] 
(batchId=232)
org.apache.hadoop.hive.cli.TestSparkCliDriver.org.apache.hadoop.hive.cli.TestSparkCliDriver
 (batchId=103)
org.apache.hadoop.hive.metastore.TestMetaStoreAuthorization.testMetaStoreAuthorization
 (batchId=205)
org.apache.hadoop.hive.ql.parse.TestReplicationScenariosAcrossInstances.testBootstrapFunctionReplication
 (batchId=216)
org.apache.hadoop.hive.ql.parse.TestReplicationScenariosAcrossInstances.testCreateFunctionIncrementalReplication
 (batchId=216)
org.apache.hadoop.hive.ql.parse.TestReplicationScenariosAcrossInstances.testCreateFunctionWithFunctionBinaryJarsOnHDFS
 (batchId=216)
org.apache.hive.hcatalog.api.TestHCatClient.testPartitionRegistrationWithCustomSchema
 (batchId=177)
org.apache.hive.hcatalog.api.TestHCatClient.testPartitionSpecRegistrationWithCustomSchema
 (batchId=177)
org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation 
(batchId=177)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5640/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5640/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5640/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 14 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12872933 - PreCommit-HIVE-Build

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.10.patch, HIVE-16600.11.patch, 
> HIVE-16600.12.patch, HIVE-16600.13.patch, HIVE-16600.1.patch, 
> HIVE-16600.2.patch, HIVE-16600.3.patch, HIVE-16600.4.patch, 
> HIVE-16600.5.patch, HIVE-16600.6.patch, HIVE-16600.7.patch, 
> HIVE-16600.8.patch, HIVE-16600.9.patch, mr.explain, 
> mr.explain.log.HIVE-16600, Node.java, 
> TestSetSparkReduceParallelism_MultiInsertCase.java
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-06-13 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048690#comment-16048690
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]: 
bq. yeah we can update it when HIVE-6348 is fixed. Right now we need it 
otherwise the patch is not really tested. Please update the RB accordingly.
actually this case is included in HIVE-16600.12.patch and the code has been 
updated in the RB.


> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.10.patch, HIVE-16600.11.patch, 
> HIVE-16600.12.patch, HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, HIVE-16600.8.patch, 
> HIVE-16600.9.patch, mr.explain, mr.explain.log.HIVE-16600, Node.java, 
> TestSetSparkReduceParallelism_MultiInsertCase.java
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-06-13 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048643#comment-16048643
 ] 

Rui Li commented on HIVE-16600:
---

[~kellyzly], yeah we can update it when HIVE-6348 is fixed. Right now we need 
it otherwise the patch is not really tested. Please update the RB accordingly.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.10.patch, HIVE-16600.11.patch, 
> HIVE-16600.12.patch, HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, HIVE-16600.8.patch, 
> HIVE-16600.9.patch, mr.explain, mr.explain.log.HIVE-16600, Node.java, 
> TestSetSparkReduceParallelism_MultiInsertCase.java
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-06-12 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16047392#comment-16047392
 ] 

Hive QA commented on HIVE-16600:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12872796/HIVE-16600.12.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 7 failed/errored test(s), 10830 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[insert_overwrite_local_directory_1]
 (batchId=237)
org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[materialized_view_create_rewrite]
 (batchId=237)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=140)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr]
 (batchId=145)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query16] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query94] 
(batchId=232)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5632/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5632/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5632/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 7 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12872796 - PreCommit-HIVE-Build

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.10.patch, HIVE-16600.11.patch, 
> HIVE-16600.12.patch, HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, HIVE-16600.8.patch, 
> HIVE-16600.9.patch, mr.explain, mr.explain.log.HIVE-16600, Node.java, 
> TestSetSparkReduceParallelism_MultiInsertCase.java
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-06-12 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16046413#comment-16046413
 ] 

Rui Li commented on HIVE-16600:
---

Thanks for updating [~kellyzly]. Patch looks good overall. I've left some 
comments on RB.
[~xuefuz] do you also want to take a look? Thanks.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.10.patch, HIVE-16600.11.patch, 
> HIVE-16600.1.patch, HIVE-16600.2.patch, HIVE-16600.3.patch, 
> HIVE-16600.4.patch, HIVE-16600.5.patch, HIVE-16600.6.patch, 
> HIVE-16600.7.patch, HIVE-16600.8.patch, HIVE-16600.9.patch, mr.explain, 
> mr.explain.log.HIVE-16600, Node.java, 
> TestSetSparkReduceParallelism_MultiInsertCase.java
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-06-12 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16046372#comment-16046372
 ] 

Hive QA commented on HIVE-16600:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12872663/HIVE-16600.11.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 10830 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=140)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query16] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query94] 
(batchId=232)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5625/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5625/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5625/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 5 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12872663 - PreCommit-HIVE-Build

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.10.patch, HIVE-16600.11.patch, 
> HIVE-16600.1.patch, HIVE-16600.2.patch, HIVE-16600.3.patch, 
> HIVE-16600.4.patch, HIVE-16600.5.patch, HIVE-16600.6.patch, 
> HIVE-16600.7.patch, HIVE-16600.8.patch, HIVE-16600.9.patch, mr.explain, 
> mr.explain.log.HIVE-16600, Node.java, 
> TestSetSparkReduceParallelism_MultiInsertCase.java
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-06-08 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16042426#comment-16042426
 ] 

Rui Li commented on HIVE-16600:
---

[~kellyzly], both RS and FS are terminal operators. Several others are 
HashTableSinkOperator, vectorized RS/FS, etc. You can check implementations of 
TerminalOperator.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.10.patch, HIVE-16600.1.patch, 
> HIVE-16600.2.patch, HIVE-16600.3.patch, HIVE-16600.4.patch, 
> HIVE-16600.5.patch, HIVE-16600.6.patch, HIVE-16600.7.patch, 
> HIVE-16600.8.patch, HIVE-16600.9.patch, mr.explain, 
> mr.explain.log.HIVE-16600, Node.java, 
> TestSetSparkReduceParallelism_MultiInsertCase.java
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-06-07 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16041764#comment-16041764
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]: after HIVE-6384,   there will not be order  but be order by limit in 
multi-insert.  All order by limit case will be judged in same way whether in 
multi-insert or non multi-insert case.
bq.basically I meant we can fall back to v1 patch, combined with the new tests. 
Besides, instead of checking reaching RS/FS, we can check reaching terminal 
operators.
  agree with falling back to v1 patch.  But not very understanding why instead 
of checking reaching RS/FS, we can check reaching terminal operators.
{code}
RS-LIMIT-RS-SEL-FS
{code}
 in above case, you can judge this is an order by limit case when second RS is 
encountered. why need reach to FS?





> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.10.patch, HIVE-16600.1.patch, 
> HIVE-16600.2.patch, HIVE-16600.3.patch, HIVE-16600.4.patch, 
> HIVE-16600.5.patch, HIVE-16600.6.patch, HIVE-16600.7.patch, 
> HIVE-16600.8.patch, HIVE-16600.9.patch, mr.explain, 
> mr.explain.log.HIVE-16600, Node.java, 
> TestSetSparkReduceParallelism_MultiInsertCase.java
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-06-07 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040540#comment-16040540
 ] 

Rui Li commented on HIVE-16600:
---

If the order by in sub query is followed by limit, it won't be removed. But in 
that case, parallel order by will be disabled anyway because of the limit, 
right?

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.10.patch, HIVE-16600.1.patch, 
> HIVE-16600.2.patch, HIVE-16600.3.patch, HIVE-16600.4.patch, 
> HIVE-16600.5.patch, HIVE-16600.6.patch, HIVE-16600.7.patch, 
> HIVE-16600.8.patch, HIVE-16600.9.patch, mr.explain, 
> mr.explain.log.HIVE-16600, Node.java, 
> TestSetSparkReduceParallelism_MultiInsertCase.java
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-06-07 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040535#comment-16040535
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]: what i am very confused is order by will be removed in the sub query 
of multi-insert unless you use limit in HIVE-6348
{code}
select * from (select * from foo order by c asc limit 10) bar order by c desc;

{code}
in these case first order by will be removed?

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.10.patch, HIVE-16600.1.patch, 
> HIVE-16600.2.patch, HIVE-16600.3.patch, HIVE-16600.4.patch, 
> HIVE-16600.5.patch, HIVE-16600.6.patch, HIVE-16600.7.patch, 
> HIVE-16600.8.patch, HIVE-16600.9.patch, mr.explain, 
> mr.explain.log.HIVE-16600, Node.java, 
> TestSetSparkReduceParallelism_MultiInsertCase.java
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-06-07 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040512#comment-16040512
 ] 

Rui Li commented on HIVE-16600:
---

With HIVE-6348, we won't have such an order by in sub queries - multi insert 
can only have order by in the outermost select. So I don't think the complexity 
of v10 is necessary. Makes sense?

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.10.patch, HIVE-16600.1.patch, 
> HIVE-16600.2.patch, HIVE-16600.3.patch, HIVE-16600.4.patch, 
> HIVE-16600.5.patch, HIVE-16600.6.patch, HIVE-16600.7.patch, 
> HIVE-16600.8.patch, HIVE-16600.9.patch, mr.explain, 
> mr.explain.log.HIVE-16600, Node.java, 
> TestSetSparkReduceParallelism_MultiInsertCase.java
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-06-07 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040438#comment-16040438
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]: v1 patch( with tranversing to the final FS) can not deal with the 
following case
{code}

RS - LIMIT. - FS
   -  FS
{code}
this is a multi-insert but should enable parallel order by because LIMIT is in 
multi-insert  query.  
If my understanding is wrong, please tell me, besides, is there any wrong with 
algorithms in v10 patch although it is complex?

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.10.patch, HIVE-16600.1.patch, 
> HIVE-16600.2.patch, HIVE-16600.3.patch, HIVE-16600.4.patch, 
> HIVE-16600.5.patch, HIVE-16600.6.patch, HIVE-16600.7.patch, 
> HIVE-16600.8.patch, HIVE-16600.9.patch, mr.explain, 
> mr.explain.log.HIVE-16600, Node.java, 
> TestSetSparkReduceParallelism_MultiInsertCase.java
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-06-07 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040423#comment-16040423
 ] 

Rui Li commented on HIVE-16600:
---

Hi [~kellyzly], basically I meant we can fall back to v1 patch, combined with 
the new tests. Besides, instead of checking reaching RS/FS, we can check 
reaching terminal operators.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.10.patch, HIVE-16600.1.patch, 
> HIVE-16600.2.patch, HIVE-16600.3.patch, HIVE-16600.4.patch, 
> HIVE-16600.5.patch, HIVE-16600.6.patch, HIVE-16600.7.patch, 
> HIVE-16600.8.patch, HIVE-16600.9.patch, mr.explain, 
> mr.explain.log.HIVE-16600, Node.java, 
> TestSetSparkReduceParallelism_MultiInsertCase.java
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-06-06 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038321#comment-16038321
 ] 

Hive QA commented on HIVE-16600:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12871498/HIVE-16600.10.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 10820 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=140)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr]
 (batchId=145)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query78] 
(batchId=232)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5543/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5543/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5543/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 4 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12871498 - PreCommit-HIVE-Build

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.10.patch, HIVE-16600.1.patch, 
> HIVE-16600.2.patch, HIVE-16600.3.patch, HIVE-16600.4.patch, 
> HIVE-16600.5.patch, HIVE-16600.6.patch, HIVE-16600.7.patch, 
> HIVE-16600.8.patch, HIVE-16600.9.patch, mr.explain, 
> mr.explain.log.HIVE-16600, Node.java, 
> TestSetSparkReduceParallelism_MultiInsertCase.java
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-06-05 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16038276#comment-16038276
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]: sorry for reply late.
bq.I prefer to fall back to the initial method here: to check LIM before RS/FS 
in all branches, which is simpler and safer.
in HIVE-16600.10.patch, it checks all branches of multi-insert case.  
multi-insert=false
check LIMIT appears before next RS/FS
multi-insert= true
1. if current branch ends with FS, check whether LIMIT appears before 
jointOperator(means from where branch starts)
2. if current branch ends with nonFS, check LIMIT appears before next RS/FS

Although currently i can not find the case you mentioned above in hive queries.
{noformat}
RS - ... - FS
   - ... - FS
   - ... - Non FS
{noformat}

I create some unit test 
files(TestSetSparkReduceParallelism_MultiInsertCase.java, Node.java) which 
mocks the different cases(include the case you mentioned)

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, HIVE-16600.8.patch, 
> HIVE-16600.9.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-06-02 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034519#comment-16034519
 ] 

Rui Li commented on HIVE-16600:
---

[~kellyzly], sorry for the delay.
bq. what i am confused is above case is not a multi insert case?
They're multi insert. But we don't know whether LIM will be handled differently 
in those non-FS branches. So they might need be treated differently from the FS 
branches right?
If HIVE-6348 can be fixed, it'll solve the problem here. If not, I prefer to 
fall back to the initial method here: to check LIM before RS/FS in all 
branches, which is simpler and safer.
[~xuefuz], could you also share your thoughts? Thanks.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, HIVE-16600.8.patch, 
> HIVE-16600.9.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-06-01 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034074#comment-16034074
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]: appreciate to get some comments about algorithms or v9 patch.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, HIVE-16600.8.patch, 
> HIVE-16600.9.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-31 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16032067#comment-16032067
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]: thanks for review. I think my algorithm in v9 patch will consider 
above case a multi-insert case  as there is more than 1 path to RS to FS
{code}
RS - ... - FS
   - ... - FS
   - ... - Non FS
{code}

{code}
// the multi insert case is like
  // TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
  //  -SEL[6]-LIM[7]-RS[8]-SEL[9]-LIM[10]-FS[11]
  // verify Multi Insert case: if there are more than 1 path from RS(RS[2]) to 
FS in the operator tree, it is a multi-insert
  // case
  private boolean isMultiInsert(ReduceSinkOperator rs) {
int pathToFSNum = 0;
Deque> childQueue = new LinkedList<>();
childQueue.addAll(rs.getChildOperators());
while (!childQueue.isEmpty()) {
  Operator child = childQueue.pop();
  if (child instanceof FileSinkOperator) {
pathToFSNum = pathToFSNum + 1;
  } else {
childQueue.addAll(child.getChildOperators());
  }
}
boolean isMultiInsert = pathToFSNum > 1 ? true : false;
LOG.debug("reducesink:" + rs + " isMultiInsert:" + isMultiInsert);
return isMultiInsert;
  }
{code}
what i am confused is above case is not a multi insert case?  Is it a case 
about spark dynamic partition pruning?
bq.Besides, I'm wondering whether it's better to avoid such order by in 
sub-queries in the first place, as it is essentially pointless.
Agree


> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, HIVE-16600.8.patch, 
> HIVE-16600.9.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-31 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16031007#comment-16031007
 ] 

Rui Li commented on HIVE-16600:
---

Hi [~kellyzly], thanks for the example. As I said, I agree disabling parallel 
orderBy for non multi insert is conservative. I just wanted to limit the scope 
of this JIRA. And another thing I'm not sure about your patch is how it handles 
a case like following. Will it be treated as multi insert?
{noformat}
RS - ... - FS
   - ... - FS
   - ... - Non FS
{noformat}

Besides, I'm wondering whether it's better to avoid such order by in 
sub-queries in the first place, as it is essentially pointless. Other DBs like 
SQL Server does this. I'm investigating it in HIVE-6348.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, HIVE-16600.8.patch, 
> HIVE-16600.9.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-26 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16027207#comment-16027207
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]: actually the algorithms in HIVE-16600.9.patch is similar as yours. 
The safe way to judge an order by limit case is 
1. verify whether there is a limit between current RS to next RS/FS in non 
multi-insert case.
2. verify whether there is a limit between current RS to jointOperator in 
multi-insert case. Here jointOperator is the operator where the branches start.


> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, HIVE-16600.8.patch, 
> HIVE-16600.9.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-26 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16027197#comment-16027197
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]:
i found that following example does not suit your algorithms
{code}
set hive.execution.engine=spark; 
set hive.auto.convert.join.noconditionaltask.size=20; 
set hive.spark.dynamic.partition.pruning=true;
set hive.exec.reducers.bytes.per.reducer=256;
set hive.optimize.sampling.orderby=true;

explain select * from srcpart join (select * from srcpart_date_hour order by 
hour) srcpart_date_hour1 on (srcpart.ds = srcpart_date_hour1.ds and srcpart.hr 
= srcpart_date_hour1.hr) where srcpart_date_hour1.`date` = '2008-04-08' and 
srcpart_date_hour1.hour = 11;
{code}

the logical plan 
{code}
TS[0]-FIL[15]-SEL[1]-RS[2]-SEL[3]-RS[7]-JOIN[8]-SEL[10]-FS[11]
 -SEL[17]-GBY[18]-SPARKPRUNINGSINK[19]
 -SEL[20]-GBY[21]-SPARKPRUNINGSINK[22]
TS[4]-RS[5]-JOIN[8]

{code}
{noformat}
the RS[2] is order which should enable parallel order by. But because SEL[3] 
has 3 children (RS[7],SEL[17],SEL[20]) and isMultiInsert is false, this cause 
RS[2] will not be enabled parallel order by.
{noformat}


> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, HIVE-16600.8.patch, 
> HIVE-16600.9.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-26 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16027194#comment-16027194
 ] 

Rui Li commented on HIVE-16600:
---

[~kellyzly], one example of non multi insert having branches is dynamic 
partition pruning. I agree such cases may also be eligible for parallel order 
by, but that needs further investigation. So let's limit the scope to multi 
insert here. Does that make sense?

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, HIVE-16600.8.patch, 
> HIVE-16600.9.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-26 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16027139#comment-16027139
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]: the algorithms you provided seems ok except 1 point
here op has branches means op.getChildOperators().size()>1? if there are more 
than 1 child, parallel order is not enabled if it is a non multi-insert case?  
I don't know whether there is a non multi insert order case which contains  an 
operator which has more than 1 child  or not.

{code}
RS rs;
Operator op=rs;
while(op!=null){
  if(op instanceof LIM){
return false;
  }
  if((op instanceof RS && op!=rs) || op instanceof FS){
return true;
  }
  if(op has branches){  
return isMultiInsert;
  }
  op=op.child;
}
return true;
{code}


> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, HIVE-16600.8.patch, 
> HIVE-16600.9.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-26 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026089#comment-16026089
 ] 

Hive QA commented on HIVE-16600:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12870025/HIVE-16600.9.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 10788 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[create_merge_compressed]
 (batchId=237)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_scalar]
 (batchId=152)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5439/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5439/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5439/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 2 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12870025 - PreCommit-HIVE-Build

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, HIVE-16600.8.patch, 
> HIVE-16600.9.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-26 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026055#comment-16026055
 ] 

Rui Li commented on HIVE-16600:
---

Hi [~kellyzly], quick question, is the v9 patch logically different from the 
algorithm I proposed? You code seems more complicated.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, HIVE-16600.8.patch, 
> HIVE-16600.9.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-26 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026027#comment-16026027
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]: help review HIVE-16600.9.patch

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, HIVE-16600.8.patch, 
> HIVE-16600.9.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-26 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16026023#comment-16026023
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]:  thanks for your algorithm.  I change the algorithm by following
{noformat}
if MultiInsert
   jointOperator=getJointOperator  #jointOperator is the operator where the 
branches start.

orderByLimit(isMultiInsert, jointOperator)

/**Judge orderByLimit case:
 *  non multi-insert: If there is a Limit in the path which is from 
reduceSink to next RS/FS, return true otherwise false
 *  multi-insert: If there is a Limit in the path which is from reduceSink 
to joinOperator, return true otherwise false
   */
isOrderByLimit(isMultiInsert, jointOperator){}
{noformat}

the bug you found in HIVE-16600.8.patch is like
{noformat}
TS[0]-SEL[1]-RS[2]-SEL[3]-LIM[4]-RS[6]-FOR[7]-GBY[8]-SEL[9]-FS[11]
  -GBY[12]-SEL[13]-FS[15]

it is a multi-insert case and an order by case(LIMIT[4]  after RS[2]) and 
should not enable parallel order by. Let me explain it in above algorithm. 
first calculate whether it is a multi-insert case, if yes, calculate the 
jointOperator(FOR[7]). In orderByLimit, return true because there is LIMIT from 
the path from RS(RS[2]) to jointOperator(FOR[7]).
{noformat}



> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, HIVE-16600.8.patch, 
> HIVE-16600.9.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-24 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022736#comment-16022736
 ] 

Rui Li commented on HIVE-16600:
---

[~kellyzly], I think there's a bug in latest patch. Suppose we have OP tree 
like this: {{RS -> LIM -> MultiInsert}}. The RS shouldn't be a parallel order 
by because it's followed by a LIM. But since this is a multi insert, we'll end 
up with parallel RS.
How about we use the following algorithm:
{code}
RS rs;
Operator op=rs;
while(op!=null){
  if(op instanceof LIM){
return false;
  }
  if((op instanceof RS && op!=rs) || op instanceof FS){
return true;
  }
  if(op has branches){
return isMultiInsert;
  }
  op=op.child;
}
return true;
{code}

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, HIVE-16600.8.patch, mr.explain, 
> mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-24 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022458#comment-16022458
 ] 

Hive QA commented on HIVE-16600:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12869588/HIVE-16600.8.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 10749 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[subquery_scalar]
 (batchId=151)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5413/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5413/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5413/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 1 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12869588 - PreCommit-HIVE-Build

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, HIVE-16600.8.patch, mr.explain, 
> mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-23 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16021627#comment-16021627
 ] 

Xuefu Zhang commented on HIVE-16600:


Yeah, I agree with [~lirui] that multi-insert operator tree doesn't necessarily 
branch at SEL operator. It does if the selected columns in all insert branch 
are purely projections and require no shuffles. Nevertheless, it's safer not to 
make such an assumption.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-23 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16020960#comment-16020960
 ] 

Rui Li commented on HIVE-16600:
---

[~kellyzly], I tried some multi cases and they don't necessarily branch at SEL. 
I suppose this happens if the selected columns are identical in the inserts. 
E.g.
{noformat}
from (select key from src) A 
  insert into table target1 select key
  insert into table target2 select key;

from (select key,value from src order by key) A
  insert into table target1 select key,sum(value) group by key
  insert into table target2 select key,avg(value) group by key;
{noformat}

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-22 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16020655#comment-16020655
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]: any comment about HIVE-16600.7.patch?

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-19 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017161#comment-16017161
 ] 

Hive QA commented on HIVE-16600:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12868904/HIVE-16600.7.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 10738 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr]
 (batchId=144)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_3] 
(batchId=97)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_3] 
(batchId=97)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query24] 
(batchId=231)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5345/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5345/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5345/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 4 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12868904 - PreCommit-HIVE-Build

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-19 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16017081#comment-16017081
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]: one thing you proposed in review board
bq.Does multi insert always branch at SEL like this? Please try more cases to 
verify.
I think it is yes, [~xuefuz]: can you give us you option about this? multi 
insert case
{code}
  // TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
  //  
-SEL[6]-LIM[7]-RS[8]-SEL[9]-LIM[10]-FS[11]


there is definitely a SEL(SEL[3]) after RS( multi -insert RS , here RS[2]), if 
the children size of SEL(SEL[3]) is more than 1, this is how we judge a 
multi-insert case?
{code}


> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, HIVE-16600.7.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-18 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016859#comment-16016859
 ] 

Rui Li commented on HIVE-16600:
---

Thanks [~xuefuz] for your input. That's inline with my understanding. Ideally 
we should skip the order by in the sub-query because no order is guaranteed 
when we select from it.

[~kellyzly], seems we have a bug here. We can fix it in another JIRA. For now 
let's remove the test because the query plan in incorrect. Please also try the 
following:
# Add limit with same number to both inserts.
# Add limit to the sub query, i.e. {{from (select key,value from src order by 
key limit 10) ...}}

I also left some comments on RB

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015616#comment-16015616
 ] 

Hive QA commented on HIVE-16600:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12868680/HIVE-16600.6.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 10730 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[materialized_view_create_rewrite]
 (batchId=236)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[columnstats_part_coltype]
 (batchId=156)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr]
 (batchId=144)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_3] 
(batchId=97)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_3] 
(batchId=97)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5324/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5324/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5324/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 5 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12868680 - PreCommit-HIVE-Build

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, 
> HIVE-16600.6.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-17 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16014389#comment-16014389
 ] 

Xuefu Zhang commented on HIVE-16600:


Hi [~lirui], For your example, my understanding is that the target table isn't 
expected to contained top 5, ordered result from {{A}}. Assuming {{A}} is a 
regular table that happens to be ordered,{{select A.k from A limit 5}} can 
choose any 5 records from A. Both orderby and limit clause is to control the 
query output. In your example, the orderby clause is to control the output of 
the subquery, which is {{A}}. That means that {{A}} is expected to contain all 
ordered rows in {{src}}. The limit clause is to control the output of the 
second select statement, which has no confusion. Even if {{A}} is ordered, the 
second {{select ... from}} can completely ignore that order and return an 
unordered set of A, which is further limited to any 5 rows by the limit clause. 
I don't think {{select}} is equivalent to {{fetch first}} introduced in SQL2008 
or {{select first}} supported by some DB vendors. 

If a user wants the top-5 records from src, your example can be rewritten as {{ 
FROM src insert overwrite table target select src.k order by src.k limit 5}}, 
which may further clarify why in your example the target table is not expected 
to have top-5 records from src.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, mr.explain, 
> mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-16 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013480#comment-16013480
 ] 

Rui Li commented on HIVE-16600:
---

Hi [~xuefuz], I'd like to confirm what the standard says about this case. For a 
query {{from (select k from src order by k) A insert overwrite table target 
select A.k limit 5}}. Does the data in target has to be ordered or top 5 from 
A? Could you share your thoughts? Thanks.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, mr.explain, 
> mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-16 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013478#comment-16013478
 ] 

Rui Li commented on HIVE-16600:
---

[~kellyzly], thanks for the update. Just curious, have you tried adding limit 
with same number to both the inserts? Will that eliminate the extra shuffle?
bq. the result in table e2 may be different in different runtime
Then you should remove the {{select * from e2}} from your qtest. The query plan 
for such a case is still useful because it shows whether parallel order by is 
enabled for multi insert + limit.

bq. is it really better to use 1 task to execute orderby+limit
I think it's better because the data doesn't have to be shuffled again for the 
limit. But I agree it's worth some benchmark to verify.

Please add more tests. E.g. add limit to the sub-query, add order by to 
inserts, etc.
And please also create an RB for your patch.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, mr.explain, 
> mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-16 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16011923#comment-16011923
 ] 

Hive QA commented on HIVE-16600:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12868230/HIVE-16600.5.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 10717 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[table_nonprintable]
 (batchId=140)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr]
 (batchId=144)
org.apache.hive.jdbc.TestJdbcWithMiniHS2.testEnableThriftSerializeInTasks 
(batchId=225)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5277/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5277/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5277/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 3 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12868230 - PreCommit-HIVE-Build

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, mr.explain, 
> mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-16 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16011854#comment-16011854
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]:
1 question for:
bq.For orderBy + Limit query, we have two choices: use multiple reducers to get 
global order, and then shuffle to get global limit. Or we can use single 
reducer to do the order and limit together. I think the latter is better 
because we don't actually need to get a global order.

for orderby+limit case(except order by + mulit_insert +limit), is it really 
better to use 1 task to execute orderby+limit other than using parallel order 
by+ extra shuffle to collect the the result of sort +limit ? Have not run 
benchmark to test it, but if you know the detail reason, please tell me.


> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, HIVE-16600.5.patch, mr.explain, 
> mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-15 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16011763#comment-16011763
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]:
bq. For orderBy + Limit query, we have two choices: use multiple reducers to 
get global order, and then shuffle to get global limit. Or we can use single 
reducer to do the order and limit together. I think the latter is better 
because we don't actually need to get a global order.
agree
bq.In our multi insert example, it seems we cannot choose the latter. I suspect 
that's because there's only one limit. liyunzhang_intel, could you try adding 
limit to both inserts and see what happens?
 It seems that MR also uses extra shuffle in two limit case:
{code}
set hive.mapred.mode=nonstrict;
set hive.exec.reducers.bytes.per.reducer=256;
set hive.optimize.sampling.orderby=true;


drop table if exists e1;
drop table if exists e2;


create table e1 (key string, value string);
create table e2 (key string);
FROM (select key,value from src order by key) a
INSERT OVERWRITE TABLE e1
SELECT key, value limit 5
INSERT OVERWRITE TABLE e2
SELECT key limit 10; 

select * from e1;
select * from e2;
{code}

explain 
{code}
   Stage-2 is a root stage [MAPRED]
  Stage-3 depends on stages: Stage-2 [MAPRED]
  Stage-0 depends on stages: Stage-3 [MOVE]
  Stage-4 depends on stages: Stage-0 [STATS]
  Stage-5 depends on stages: Stage-2 [MAPRED]
  Stage-1 depends on stages: Stage-5 [MOVE]
  Stage-6 depends on stages: Stage-1 [STATS]

STAGE PLANS:
  Stage: Stage-2
Map Reduce
  Map Operator Tree:
  TableScan
alias: src
GatherStats: false
Select Operator
  expressions: key (type: string), value (type: string)
  outputColumnNames: _col0, _col1
  Reduce Output Operator
key expressions: _col0 (type: string)
null sort order: a
sort order: +
tag: -1
value expressions: _col1 (type: string)
auto parallelism: false
  Path -> Alias:
hdfs://bdpe41:8020/user/hive/warehouse/src [a:src]
  Path -> Partition:
hdfs://bdpe41:8020/user/hive/warehouse/src 
  Partition
base file name: src
input format: org.apache.hadoop.mapred.TextInputFormat
output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
  bucket_count -1
  column.name.delimiter ,
  columns key,value
  columns.comments 'default','default'
  columns.types string:string
  file.inputformat org.apache.hadoop.mapred.TextInputFormat
  file.outputformat 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  location hdfs://bdpe41:8020/user/hive/warehouse/src
  name default.src
  numFiles 1
  numRows 0
  rawDataSize 0
  serialization.ddl struct src { string key, string value}
  serialization.format 1
  serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  totalSize 5812
  transient_lastDdlTime 1493960133
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  properties:
bucket_count -1
column.name.delimiter ,
columns key,value
columns.comments 'default','default'
columns.types string:string
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://bdpe41:8020/user/hive/warehouse/src
name default.src
numFiles 1
numRows 0
rawDataSize 0
serialization.ddl struct src { string key, string value}
serialization.format 1
serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 5812
transient_lastDdlTime 1493960133
  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  name: default.src
name: default.src
  Truncated Path -> Alias:
/src [a:src]
  Needs Tagging: false
  Reduce Operator Tree:
Select Operator
  expressions: KEY.reducesinkkey0 (type: string), VALUE._col0 (type: 
string)
  outputColumnNames: _col0, _col1
  Limit
Number of rows: 5
File Output Operator
  compressed: false
  GlobalTableId: 0
  d

[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-14 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009970#comment-16009970
 ] 

Hive QA commented on HIVE-16600:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12868002/HIVE-16600.4.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 10698 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.llap.security.TestLlapSignerImpl.testSigning 
(batchId=287)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5256/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5256/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5256/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 1 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12868002 - PreCommit-HIVE-Build

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-14 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009958#comment-16009958
 ] 

Rui Li commented on HIVE-16600:
---

[~kellyzly], my point is the behavior is different between multi insert and 
simple case, regarding whether there's extra shuffle. We disabled parallel 
order by to avoid the extra shuffle in simple case. But in multi insert, it 
seems the extra shuffle can't be avoided. So there's no point in disabling. 
Looking at the MR plan you uploaded, you can see MR doesn't disable parallel 
order by in this case.
Let me summarise my understanding and suggestions.
# For {{orderBy + Limit}} query, we have two choices: use multiple reducers to 
get global order, and then shuffle to get global limit. Or we can use single 
reducer to do the order and limit together. I think the latter is better 
because we don't actually need to get a global order.
# In our multi insert example, it seems we cannot choose the latter. I suspect 
that's because there's only one limit. [~kellyzly], could you try adding limit 
to both inserts and see what happens?
# Ideally, we should be able to use single reducer if all the inserts have 
limit (with same or similar number perhaps). If not, we shouldn't disable 
parallel order by for multi insert.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, HIVE-16600.4.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-12 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16009144#comment-16009144
 ] 

Hive QA commented on HIVE-16600:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12867802/mr.explain

{color:red}ERROR:{color} -1 due to build exiting with an error

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5240/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5240/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5240/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Tests exited with: NonZeroExitCodeException
Command 'bash /data/hiveptest/working/scratch/source-prep.sh' failed with exit 
status 1 and output '+ date '+%Y-%m-%d %T.%3N'
2017-05-13 05:32:36.913
+ [[ -n /usr/lib/jvm/java-8-openjdk-amd64 ]]
+ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ export 
PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ 
PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m '
+ ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m '
+ export 'MAVEN_OPTS=-Xmx1g '
+ MAVEN_OPTS='-Xmx1g '
+ cd /data/hiveptest/working/
+ tee /data/hiveptest/logs/PreCommit-HIVE-Build-5240/source-prep.txt
+ [[ false == \t\r\u\e ]]
+ mkdir -p maven ivy
+ [[ git = \s\v\n ]]
+ [[ git = \g\i\t ]]
+ [[ -z master ]]
+ [[ -d apache-github-source-source ]]
+ [[ ! -d apache-github-source-source/.git ]]
+ [[ ! -d apache-github-source-source ]]
+ date '+%Y-%m-%d %T.%3N'
2017-05-13 05:32:36.916
+ cd apache-github-source-source
+ git fetch origin
+ git reset --hard HEAD
HEAD is now at f2fa83c HIVE-16658: TestTimestampTZ.java has missed the ASF 
header (Saijin Huang via Rui)
+ git clean -f -d
+ git checkout master
Already on 'master'
Your branch is up-to-date with 'origin/master'.
+ git reset --hard origin/master
HEAD is now at f2fa83c HIVE-16658: TestTimestampTZ.java has missed the ASF 
header (Saijin Huang via Rui)
+ git merge --ff-only origin/master
Already up-to-date.
+ date '+%Y-%m-%d %T.%3N'
2017-05-13 05:32:37.843
+ patchCommandPath=/data/hiveptest/working/scratch/smart-apply-patch.sh
+ patchFilePath=/data/hiveptest/working/scratch/build.patch
+ [[ -f /data/hiveptest/working/scratch/build.patch ]]
'
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12867802 - PreCommit-HIVE-Build

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, mr.explain, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-11 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007027#comment-16007027
 ] 

Hive QA commented on HIVE-16600:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12867515/HIVE-16600.3.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 15 failed/errored test(s), 10686 tests 
executed
*Failed tests:*
{noformat}
TestHiveServer2SessionTimeout - did not produce a TEST-*.xml file (likely timed 
out) (batchId=226)
org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[mapjoin2] 
(batchId=236)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[columnstats_part_coltype]
 (batchId=156)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr]
 (batchId=144)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainanalyze_3] 
(batchId=97)
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_3] 
(batchId=97)
org.apache.hive.jdbc.TestJdbcWithMiniHS2.testAddJarDataNucleusUnCaching 
(batchId=225)
org.apache.hive.jdbc.TestJdbcWithMiniHS2.testEnableThriftSerializeInTasks 
(batchId=225)
org.apache.hive.minikdc.TestJdbcWithDBTokenStore.testConnection (batchId=237)
org.apache.hive.minikdc.TestJdbcWithDBTokenStore.testIsValid (batchId=237)
org.apache.hive.minikdc.TestJdbcWithDBTokenStore.testIsValidNeg (batchId=237)
org.apache.hive.minikdc.TestJdbcWithDBTokenStore.testNegativeProxyAuth 
(batchId=237)
org.apache.hive.minikdc.TestJdbcWithDBTokenStore.testNegativeTokenAuth 
(batchId=237)
org.apache.hive.minikdc.TestJdbcWithDBTokenStore.testProxyAuth (batchId=237)
org.apache.hive.minikdc.TestJdbcWithDBTokenStore.testTokenAuth (batchId=237)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5198/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5198/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5198/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 15 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12867515 - PreCommit-HIVE-Build

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-11 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006945#comment-16006945
 ] 

Hive QA commented on HIVE-16600:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12867515/HIVE-16600.3.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 10684 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[columnstats_part_coltype]
 (batchId=156)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5197/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5197/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5197/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 1 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12867515 - PreCommit-HIVE-Build

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-11 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006117#comment-16006117
 ] 

Rui Li commented on HIVE-16600:
---

[~kellyzly], it's not possible for Stage-2 to have extra reduce stage because 
each MR job can have at most one reduce stage. But that doesn't mean MR doesn't 
involve extra stage for the multi insert. I'm suspecting Stage-4 is the extra 
stage for MR. Please check what it does to confirm. If I'm correct, it's only 
intended to get the global limit.
bq. no, the data in the target table is ordered by alphabet not by number
The very last output of the qtest is:
{noformat}
0
100
10
100
0
103
104
103
104
0
{noformat}
which is not ordered. I guess the source table being ordered doesn't mean the 
inserted data is ordered.
bq. but i guess before the patch, there are extra reduce stage when there is 
limit in multi insert.
That's probably true. But why we disable parallel order by when there's a limit 
is to avoid an extra stage (see this 
[comment|https://issues.apache.org/jira/browse/HIVE-10458?focusedCommentId=14539299&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14539299]).
 If the extra stage is needed anyway, then it makes no sense to disable 
parallel order by.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-11 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006077#comment-16006077
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui] :  thanks for review.
bq.Why do we need cast(key as double) as keyD? It's not needed in the target 
table right?
yes, cast(key as double) as keyD is unnecessary

bq.The data in the target tables is not ordered. You should add -- 
SORT_QUERY_RESULTS to the test.
no, the data in the target table is ordered by alphabet not by number as the 
type of key is string. so we need not add {{SORT_QUERY_RESULTS}}

bq.It's quite interesting that we have an extra reduce stage when there's limit 
in multi insert. 
need time to investigate. but i guess before the patch, there are extra reduce 
stage when there is limit in multi insert.

Upload HIVE-16600.3.patch, changes in HIVE-16600.3.patch is
1. modify multi_insert_parallel_orderby.q by removing cast(key as double) as 
keyD
2. update testconfiguration.properties for the test framework to pick up the 
new test


> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> HIVE-16600.3.patch, mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-11 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006069#comment-16006069
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]: i did not copy all the plan of mr. but in the attachment 
mr.explain.log.HIVE-16600.  But we can see that in the reduce operator tree of 
Stage-2, it contains two select operators which contains multi insert.  This 
shows that there is no extra stage in mr mode.
{code}
  Reduce Operator Tree:
Select Operator
  expressions: KEY.reducesinkkey0 (type: string), VALUE._col1 (type: 
string)
  outputColumnNames: _col0, _col2
  Select Operator
expressions: _col0 (type: string), _col2 (type: string)
outputColumnNames: _col0, _col1
File Output Operator
  compressed: false
  GlobalTableId: 1
  directory: 
hdfs://bdpe41:8020/user/hive/warehouse/e1/.hive-staging_hive_2017-05-11_15-03-16_656_7584605594727686973-1/-ext-1
  NumFilesPerFileSink: 1
  Stats Publishing Key Prefix: 
hdfs://bdpe41:8020/user/hive/warehouse/e1/.hive-staging_hive_2017-05-11_15-03-16_656_7584605594727686973-1/-ext-1/
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
  properties:
COLUMN_STATS_ACCURATE {"BASIC_STATS":"true"}
bucket_count -1
column.name.delimiter ,
columns key,value
columns.comments 
columns.types string:string
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://bdpe41:8020/user/hive/warehouse/e1
name default.e1
numFiles 0
numRows 0
rawDataSize 0
serialization.ddl struct e1 { string key, string value}
serialization.format 1
serialization.lib 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 0
transient_lastDdlTime 1494486196
  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  name: default.e1
  TotalFiles: 1
  GatherStats: true
  MultiFileSpray: false
  Select Operator
expressions: _col0 (type: string)
outputColumnNames: _col0
Limit
  Number of rows: 10
  File Output Operator
compressed: false
GlobalTableId: 0
directory: 
hdfs://bdpe41:8020/tmp/hive/root/7f5afded-5f75-46f9-a588-e7317a8decca/hive_2017-05-11_15-03-16_656_7584605594727686973-1/-mr-10004
NumFilesPerFileSink: 1
table:
input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
properties:
  column.name.delimiter ,
  columns _col0
  columns.types string
  escape.delim \
  serialization.lib 
org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
serde: 
org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
TotalFiles: 1
GatherStats: false
MultiFileSpray: false

{code}


> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> rea

[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-11 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16006065#comment-16006065
 ] 

Rui Li commented on HIVE-16600:
---

[~kellyzly], what does the {{Stage: Stage-4}} do in the file you uploaded? 
Seems it's a map only job but ends with RS, which is strange.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch, 
> mr.explain.log.HIVE-16600
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-10 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004328#comment-16004328
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]:  the build fail is not caused by my patch.
error info is
{code}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.6.1:compile (default-compile) 
on project hive-exec: Compilation failure: Compilation failure:
[ERROR] 
/data/hiveptest/working/apache-github-source-source/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HivePlannerContext.java:[33,11]
 cannot find symbol
[ERROR] symbol:   class SubqueryConf
[ERROR] location: class 
org.apache.hadoop.hive.ql.optimizer.calcite.HivePlannerContext
[ERROR] 
/data/hiveptest/working/apache-github-source-source/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveSubQueryRemoveRule.java:[59,51]
 cannot find symbol
[ERROR] symbol:   class SubqueryConf
[ERROR] location: package org.apache.hadoop.hive.ql.optimizer.calcite
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to en
{code}

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-10 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004325#comment-16004325
 ] 

Rui Li commented on HIVE-16600:
---

Hi [~kellyzly], please check your patch, seems it doesn't build.
Meanwhile, you have to update {{testconfiguration.properties}} for the test 
framework to pick up the new test. Since the test is only intended for spark, 
you can add it in the {{spark.only.query.files}} section.
Some comments about the new qtest:
# Why do we need {{cast(key as double) as keyD}}? It's not needed in the target 
table right?
# The data in the target tables is not ordered. You should add {{-- 
SORT_QUERY_RESULTS}} to the test.
# It's quite interesting that we have an extra reduce stage when there's limit 
in multi insert. The extra stage only does the limit. And this isn't necessary 
because we're sorting with only one reducer. Per my experience, this won't 
happen with single insert. So please do some investigation on this. I think the 
extra stage can be removed, otherwise, there's no point in disabling parallel 
order by in this case.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-09 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004081#comment-16004081
 ] 

Hive QA commented on HIVE-16600:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12867250/HIVE-16600.2.patch

{color:red}ERROR:{color} -1 due to build exiting with an error

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5154/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5154/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5154/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Tests exited with: NonZeroExitCodeException
Command 'bash /data/hiveptest/working/scratch/source-prep.sh' failed with exit 
status 1 and output '+ date '+%Y-%m-%d %T.%3N'
2017-05-10 05:16:14.711
+ [[ -n /usr/lib/jvm/java-8-openjdk-amd64 ]]
+ export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
+ export 
PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ 
PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games
+ export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m '
+ ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m '
+ export 'MAVEN_OPTS=-Xmx1g '
+ MAVEN_OPTS='-Xmx1g '
+ cd /data/hiveptest/working/
+ tee /data/hiveptest/logs/PreCommit-HIVE-Build-5154/source-prep.txt
+ [[ false == \t\r\u\e ]]
+ mkdir -p maven ivy
+ [[ git = \s\v\n ]]
+ [[ git = \g\i\t ]]
+ [[ -z master ]]
+ [[ -d apache-github-source-source ]]
+ [[ ! -d apache-github-source-source/.git ]]
+ [[ ! -d apache-github-source-source ]]
+ date '+%Y-%m-%d %T.%3N'
2017-05-10 05:16:14.713
+ cd apache-github-source-source
+ git fetch origin
+ git reset --hard HEAD
HEAD is now at a113ede HIVE-16330 : Improve plans for scalar subquery with 
aggregates (Vineet Garg via Ashutosh Chauhan)
+ git clean -f -d
+ git checkout master
Already on 'master'
Your branch is up-to-date with 'origin/master'.
+ git reset --hard origin/master
HEAD is now at a113ede HIVE-16330 : Improve plans for scalar subquery with 
aggregates (Vineet Garg via Ashutosh Chauhan)
+ git merge --ff-only origin/master
Already up-to-date.
+ date '+%Y-%m-%d %T.%3N'
2017-05-10 05:16:15.272
+ patchCommandPath=/data/hiveptest/working/scratch/smart-apply-patch.sh
+ patchFilePath=/data/hiveptest/working/scratch/build.patch
+ [[ -f /data/hiveptest/working/scratch/build.patch ]]
+ chmod +x /data/hiveptest/working/scratch/smart-apply-patch.sh
+ /data/hiveptest/working/scratch/smart-apply-patch.sh 
/data/hiveptest/working/scratch/build.patch
Going to apply patch with: patch -p1
patching file 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java
patching file ql/src/test/queries/clientpositive/multi_insert_parallel_order.q
patching file 
ql/src/test/results/clientpositive/spark/multi_insert_parallel_order.q.out
+ [[ maven == \m\a\v\e\n ]]
+ rm -rf /data/hiveptest/working/maven/org/apache/hive
+ mvn -B clean install -DskipTests -T 4 -q 
-Dmaven.repo.local=/data/hiveptest/working/maven
ANTLR Parser Generator  Version 3.5.2
Output file 
/data/hiveptest/working/apache-github-source-source/metastore/target/generated-sources/antlr3/org/apache/hadoop/hive/metastore/parser/FilterParser.java
 does not exist: must build 
/data/hiveptest/working/apache-github-source-source/metastore/src/java/org/apache/hadoop/hive/metastore/parser/Filter.g
org/apache/hadoop/hive/metastore/parser/Filter.g
DataNucleus Enhancer (version 4.1.17) for API "JDO"
DataNucleus Enhancer : Classpath
>>  /usr/share/maven/boot/plexus-classworlds-2.x.jar
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MDatabase
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MFieldSchema
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MType
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MTable
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MConstraint
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MSerDeInfo
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MOrder
ENHANCED (Persistable) : 
org.apache.hadoop.hive.metastore.model.MColumnDescriptor
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MStringList
ENHANCED (Persistable) : 
org.apache.hadoop.hive.metastore.model.MStorageDescriptor
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MPartition
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MIndex
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MRole
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MRoleMap
ENHANCED (Persistable) : org.apache.hadoop.hive.metastore.model.MGlobalPrivilege
ENHANCED (Persistable) : org.apache.hadoop.hive.metas

[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-09 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004074#comment-16004074
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]: the object of multi_insert_parallel_order.q is to test 2 points
1. parallel order by + multi_insert
2.parallel order by +multi_insert( contain limit)

in #1, we should set parallelism  of sort more than 1 while in #2 the 
parallelism of sort is 1 because limit operator is its children.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch, HIVE-16600.2.patch
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-09 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002260#comment-16002260
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]: thanks for review.
bq.I remember you mentioned there was something wrong when you first enable 
parallel order by for multi insert. Have you figured out what was the cause?
{code}
set hive.exec.reducers.bytes.per.reducer=256; 
set hive.optimize.sampling.orderby=true;
set hive.execution.engine=spark;
drop table if exists e1; 
drop table if exists e2; 
create table e1 (key string, keyD double);
create table e2 (key string, keyD double, value string);
FROM (select key, cast(key as double) as keyD, value from src order by key) a
INSERT OVERWRITE TABLE e1
SELECT key, COUNT(distinct value) group by key 
INSERT OVERWRITE TABLE e2
SELECT key, sum(keyD), value group by key, value;
select * from e1; 
select * from e2; 
{code}
what i was confused last week is 
the result of e1 and e2 is not order by the key when we set  
hive.exec.reducers.bytes.per.reducer=256( this makes the parallel order by).
But forget my confuse because the result of e1 and e2 should not be in order 
because this is a orderby+groupby case, the result of groupby may not be in 
order.




> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-08 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001991#comment-16001991
 ] 

Rui Li commented on HIVE-16600:
---

Thanks [~kellyzly] for working on this. I don't think test failures are related.
I remember you mentioned there was something wrong when you first enable 
parallel order by for multi insert. Have you figured out what was the cause?
Besides, please add a qtest for this. I think it should cover simple multi 
insert, as well as multi insert + limit.

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-08 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16000392#comment-16000392
 ] 

liyunzhang_intel commented on HIVE-16600:
-

[~lirui]:
in my local environment(based on 54dbca6), the failed unit tests pass. I don't 
know whether the test failures are  caused by other problem.
And It shows 36% performance improvement in 
[TPCx-BB/query28.sql|https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench/blob/master/engines/hive/queries/q28/q28.sql]
 when parallel order by is enabled.



> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-07 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16000237#comment-16000237
 ] 

Hive QA commented on HIVE-16600:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12866818/HIVE-16600.1.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 10652 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[partition_wise_fileformat6]
 (batchId=7)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr]
 (batchId=143)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5102/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5102/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5102/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 2 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12866818 - PreCommit-HIVE-Build

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.1.patch
>
>
> multi_insert_gby.case.q
> {code}
> set hive.exec.reducers.bytes.per.reducer=256;
> set hive.optimize.sampling.orderby=true;
> drop table if exists e1;
> drop table if exists e2;
> create table e1 (key string, value string);
> create table e2 (key string);
> FROM (select key, cast(key as double) as keyD, value from src order by key) a
> INSERT OVERWRITE TABLE e1
> SELECT key, value
> INSERT OVERWRITE TABLE e2
> SELECT key;
> select * from e1;
> select * from e2;
> {code} 
> the parallelism of Sort is 1 even we enable parallel order 
> by("hive.optimize.sampling.orderby" is set as "true").  This is not 
> reasonable because the parallelism  should be calcuated by  
> [Utilities.estimateReducers|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L170]
> this is because SetSparkReducerParallelism#needSetParallelism returns false 
> when [children size of 
> RS|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L207]
>  is greater than 1.
> in this case, the children size of {{RS[2]}} is two.
> the logical plan of the case
> {code}
>TS[0]-SEL[1]-RS[2]-SEL[3]-SEL[4]-FS[5]
> -SEL[6]-FS[7]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-16600) Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel order by in multi_insert cases

2017-05-06 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-16600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15999582#comment-15999582
 ] 

Hive QA commented on HIVE-16600:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12866687/HIVE-16600.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 10652 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr]
 (batchId=143)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5083/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5083/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5083/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 1 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12866687 - PreCommit-HIVE-Build

> Refactor SetSparkReducerParallelism#needSetParallelism to enable parallel 
> order by in multi_insert cases
> 
>
> Key: HIVE-16600
> URL: https://issues.apache.org/jira/browse/HIVE-16600
> Project: Hive
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: HIVE-16600.patch
>
>
> in multi_insert cases multi_insert_gby2.q, the parallelism of SORT operator 
> is 1 even we set "hive.optimize.sampling.orderby" = true.  This is because 
> the logic of SetSparkReducerParallelism#needSetParallelism does not support 
> this case.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)