[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-19 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703102#comment-14703102
 ] 

Xuefu Zhang commented on HIVE-11502:


For my understanding, it's interesting to know why changing in ListKeyWrapper's 
hashcode solves the problem. I originally thought the problem is with hashcode 
for DoubleWritable. Any explanation would be appreciated.

 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen
 Attachments: HIVE-11502.1.patch, HIVE-11502.2.patch, 
 HIVE-11502.3.patch


 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40) and 
 it is in type double, the map side aggregation is very slow. I ran the query 
 which took more than 3 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-19 Thread Yongzhi Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704081#comment-14704081
 ] 

Yongzhi Chen commented on HIVE-11502:
-

[~xuefuz], for GroupBy's aggregate hashmap uses ListKeyWrapper as key, so it 
uses the ListKey's hashcode. The HashMap does not directly use DoubleWritable's 
hashcode, so we can play in between. And it is safe too: The ListKeyWrapper is 
only used by groupby, so it is only used  internal to hive. 

 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen
 Attachments: HIVE-11502.1.patch, HIVE-11502.2.patch, 
 HIVE-11502.3.patch


 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40) and 
 it is in type double, the map side aggregation is very slow. I ran the query 
 which took more than 3 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-18 Thread Yongzhi Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701657#comment-14701657
 ] 

Yongzhi Chen commented on HIVE-11502:
-

Thanks [~csun] for reviewing the change, I attached patch 3 to remove the extra 
line. 

 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen
 Attachments: HIVE-11502.1.patch, HIVE-11502.2.patch, 
 HIVE-11502.3.patch


 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40) and 
 it is in type double, the map side aggregation is very slow. I ran the query 
 which took more than 3 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-18 Thread Chao Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701640#comment-14701640
 ] 

Chao Sun commented on HIVE-11502:
-

LGTM +1.
Can you remove the extra line in the else branch? We don't need to rerun the 
test for this change.

 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen
 Attachments: HIVE-11502.1.patch, HIVE-11502.2.patch


 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40) and 
 it is in type double, the map side aggregation is very slow. I ran the query 
 which took more than 3 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-18 Thread Yongzhi Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14702409#comment-14702409
 ] 

Yongzhi Chen commented on HIVE-11502:
-

The vectorized_parquet_types failure is not related.
It is just random failure caused by data precision. 

 1 121 1   8   1.174970197678  2.062159062730128   
90.33
---
 1 121 1   8   1.174970197678  2.0621590627301285  
 90.33
378c378
 3 120 1   7   1.171428578240531   1.8 90.21
---
 3 120 1   7   1.171428578240531   1.7996  
 90.21

Same test passed on my local machine.




 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen
 Attachments: HIVE-11502.1.patch, HIVE-11502.2.patch, 
 HIVE-11502.3.patch


 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40) and 
 it is in type double, the map side aggregation is very slow. I ran the query 
 which took more than 3 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14702269#comment-14702269
 ] 

Hive QA commented on HIVE-11502:




{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12751070/HIVE-11502.3.patch

{color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 9370 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorized_parquet_types
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/5002/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/5002/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-5002/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 1 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12751070 - PreCommit-HIVE-TRUNK-Build

 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen
 Attachments: HIVE-11502.1.patch, HIVE-11502.2.patch, 
 HIVE-11502.3.patch


 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40) and 
 it is in type double, the map side aggregation is very slow. I ran the query 
 which took more than 3 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-14 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696862#comment-14696862
 ] 

Hive QA commented on HIVE-11502:




{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12750428/HIVE-11502.2.patch

{color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 9357 tests executed
*Failed tests:*
{noformat}
TestDummy - did not produce a TEST-*.xml file
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4962/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4962/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-4962/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 1 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12750428 - PreCommit-HIVE-TRUNK-Build

 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen
 Attachments: HIVE-11502.1.patch, HIVE-11502.2.patch


 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40) and 
 it is in type double, the map side aggregation is very slow. I ran the query 
 which took more than 3 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-14 Thread Yongzhi Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696920#comment-14696920
 ] 

Yongzhi Chen commented on HIVE-11502:
-

The TestDummy failure is not related:

It failed because of FileNotFoundException:
[exec] + javac -cp 
/home/hiveptest/54.147.251.176-hiveptest-2/maven/org/apache/hive/hive-exec/2.0.0-SNAPSHOT/hive-exec-2.0.0-SNAPSHOT.jar
 /tmp/UDFExampleAdd.java -d /tmp
 [exec] + jar -cf /tmp/udfexampleadd-1.0.jar -C /tmp UDFExampleAdd.class
 [exec] java.io.FileNotFoundException: /tmp/UDFExampleAdd.class (No such 
file or directory)
 [exec] at java.io.FileInputStream.open(Native Method)
 [exec] at java.io.FileInputStream.init(FileInputStream.java:146)
 [exec] at sun.tools.jar.Main.copy(Main.java:791)
 [exec] at sun.tools.jar.Main.addFile(Main.java:740)
 [exec] at sun.tools.jar.Main.create(Main.java:491)
 [exec] at sun.tools.jar.Main.run(Main.java:201)
 [exec] at sun.tools.jar.Main.main(Main.java:1177)


 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen
 Attachments: HIVE-11502.1.patch, HIVE-11502.2.patch


 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40) and 
 it is in type double, the map side aggregation is very slow. I ran the query 
 which took more than 3 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-13 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14695787#comment-14695787
 ] 

Gopal V commented on HIVE-11502:


bq. I do not see risk here. You can assume the LazyDouble as a vectorized 
object which only has one element in it.

Changing a hashcode of an actual type breaks bucketed joins - the lhs  rhs of 
the join has to use the exact same hashcode.

Vectorization overrides that inside VectorHashKeyWrapper, which serves as a 
model for this fix.

The new hash computation needs to only go in place of the Arrays.hashCode() in 
the ListKeyWrapper - so that the only exposure to the uniform hashCode is 
within map-side aggregation.

 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen
 Attachments: HIVE-11502.1.patch


 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40) and 
 it is in type double, the map side aggregation is very slow. I ran the query 
 which took more than 3 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-13 Thread Yongzhi Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696341#comment-14696341
 ] 

Yongzhi Chen commented on HIVE-11502:
-

[~gopalv], thanks for your advice. Attach second patch. 

 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen
 Attachments: HIVE-11502.1.patch


 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40) and 
 it is in type double, the map side aggregation is very slow. I ran the query 
 which took more than 3 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-12 Thread Yongzhi Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14693556#comment-14693556
 ] 

Yongzhi Chen commented on HIVE-11502:
-

The basic idea of the fix is in LazyDouble, I separate hashcode for internal 
use from hashcode for hdfs.
The aggregation hashmap use LazyDouble, and I think hadoop use the value by 
DoubleWritable object which is an instance variable in
LazyDouble (data). The change let LazyDouble calculate its own hashcode instead 
of blindly use data's. 
I do not see risk here. You can assume the LazyDouble as a vectorized object 
which only has one element in it.
Please correct me, if you find anything is not right. 
Thanks

 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen
 Attachments: HIVE-11502.1.patch


 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40) and 
 it is in type double, the map side aggregation is very slow. I ran the query 
 which took more than 3 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-11 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692440#comment-14692440
 ] 

Hive QA commented on HIVE-11502:




{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12749876/HIVE-11502.1.patch

{color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 9348 tests executed
*Failed tests:*
{noformat}
org.apache.hive.hcatalog.streaming.TestStreaming.testTransactionBatchEmptyCommit
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4925/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4925/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-4925/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 1 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12749876 - PreCommit-HIVE-TRUNK-Build

 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen
 Attachments: HIVE-11502.1.patch


 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40) and 
 it is in type double, the map side aggregation is very slow. I ran the query 
 which took more than 3 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-11 Thread Yongzhi Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692640#comment-14692640
 ] 

Yongzhi Chen commented on HIVE-11502:
-

The failure is not related:
org.apache.hive.hcatalog.streaming.TestStreaming.testTransactionBatchEmptyCommit
Table/View 'TXNS' already exists in Schema 'APP'

[~gopalv], could you review the patch and check if the change is safe to use? 



 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen
 Attachments: HIVE-11502.1.patch


 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40) and 
 it is in type double, the map side aggregation is very slow. I ran the query 
 which took more than 3 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-10 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680637#comment-14680637
 ] 

Gopal V commented on HIVE-11502:


[~ychena]: I've linked the issue to the known issue in HADOOP-12217

Is it possible that you're testing hive against different versions of Hadoop 
between 0.13 vs 1.2.?

 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen

 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40) and 
 it is in type double, the map side aggregation is very slow. I ran the query 
 which took more than 3 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-10 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680740#comment-14680740
 ] 

Gopal V commented on HIVE-11502:


A custom hashcode can be used internal to Hive (i.e group-by etc), but not 
externally to hive (bucketing into HDFS, results of hash() functions).

Because that would break external assumptions in a non-backwards-compatible way.

The reason shuffle + merge is more uniform is because it starts using [murmur 
hashes|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java#L366]
 for UNIFORM trait RS instead of the builtin writable hash funcs (which are 
skewed).

You will probably notice that using a vectorized input format like ORC would 
not have the issue you're hitting, since the vector transform inside the 
operator pipeline gives hive the opportunity to use per-operator specific 
optimizations.

 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen

 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40) and 
 it is in type double, the map side aggregation is very slow. I ran the query 
 which took more than 3 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-10 Thread Yongzhi Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680783#comment-14680783
 ] 

Yongzhi Chen commented on HIVE-11502:
-

[~gopalv], I have confirmed that HIVE-7041 caused the regression. Because the 
hadoop bug is there for a long time, after hive switch to use hadoop's 
hashcode, we got hadoop's bug. Thanks for find the root cause by pointing the 
hadoop bug.

After I add code in 
serde/src/java/org/apache/hadoop/hive/serde2/io/DoubleWritable.java
{noformat}
   @Override
   public int hashCode() {
 long v = Double.doubleToLongBits(super.get());
 return (int) (v ^ (v  32));
   }
{noformat}
The group by query can finish in 15 seconds. 

So next step is, how do we fix the issue now? 


 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen

 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40) and 
 it is in type double, the map side aggregation is very slow. I ran the query 
 which took more than 3 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-10 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680870#comment-14680870
 ] 

Gopal V commented on HIVE-11502:


bq. So next step is, how do we fix the issue now?

Easiest would be to use vectorization, which doesn't need any Writables in the 
inner loop.

The vector hashcode for doubles would automatically be very similar to your 
impl (from Arrays.hashCode(double[]))

{code}

for (double element : a) {
long bits = Double.doubleToLongBits(element);
result = 31 * result + (int)(bits ^ (bits  32));
}
return result;
{code}

 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen

 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40) and 
 it is in type double, the map side aggregation is very slow. I ran the query 
 which took more than 3 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-10 Thread Yongzhi Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14680720#comment-14680720
 ] 

Yongzhi Chen commented on HIVE-11502:
-

[~gopalv], I checked the related hadoop code between two versions used by 0.13 
and 1.2, there is no change in hadoop side for DoubleWritable. 
I think the regression may relate to HIVE-7041 which switch from using hive's 
own DoubleWritable to hadoop's . But just revert the change cause exceptions, I 
am still looking at it. 

 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen

 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40) and 
 it is in type double, the map side aggregation is very slow. I ran the query 
 which took more than 3 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-10 Thread Yongzhi Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681015#comment-14681015
 ] 

Yongzhi Chen commented on HIVE-11502:
-

[~gopalv], thanks for the workaround. But I am afraid some users do not want to 
change their input format. And this HashMap may affect mapjoin too. We help a 
user workaround this map side aggregation issue by set hive.map.aggr = false; 
After that, the simple group test case has very good performance, but a more 
complicated join query with group by as subquery stuck on mapjoin. So we have 
to let the user turn off mapjoin by set hive.auto.convert.join=false;  The 
performance hit by this bug is really outstanding. Without workaround, none of 
the query can finish in several hours. So I think we have to fix it. 

 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen

 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40) and 
 it is in type double, the map side aggregation is very slow. I ran the query 
 which took more than 3 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-09 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14679402#comment-14679402
 ] 

Zheng Shao commented on HIVE-11502:
---

Seems like that the new version of Hive introduced KeyWrapperFactory which 
wraps keys for HashMap so that all kinds of objects can be used as HashMap 
keys.  This should not be necessary if the key objects are already capable of 
being HashMap keys (like Java Primitive Objects and Writable Objects) where 
hashVode() and equals() are well 

 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen

 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40) and 
 it is in type double, the map side aggregation is very slow. I ran the query 
 which took more than 3 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-09 Thread Yongzhi Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14679434#comment-14679434
 ] 

Yongzhi Chen commented on HIVE-11502:
-

Confirmed that only double type has regression.  For other types (such as int, 
bigint, float) used as group by column, there is no performance regression in 
map side aggregation.  

 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen

 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40) and 
 it is in type double, the map side aggregation is very slow. I ran the query 
 which took more than 3 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11502) Map side aggregation is extremely slow

2015-08-08 Thread Yongzhi Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14663199#comment-14663199
 ] 

Yongzhi Chen commented on HIVE-11502:
-

It is a regression, I ran the query in 0.13.1 version with hive.map.aggr is 
true, it finished in 20 seconds.
In the master branch, the major time spend in following stack.
{noformat}
HashMapK,V.getEntry(Object) line: 465 
HashMapK,V.get(Object) line: 417  
PrimitiveObjectInspectorUtils.getTypeEntryFromTypeName(String) line: 373
PrimitiveTypeInfo.getPrimitiveTypeEntry() line: 85  
PrimitiveTypeInfo.getPrimitiveCategory() line: 63   
WritableDoubleObjectInspector(AbstractPrimitiveObjectInspector).getPrimitiveCategory()
 line: 58 
ObjectInspectorUtils.compare(Object, ObjectInspector, Object, ObjectInspector, 
MapEqualComparer) line: 694  
ObjectInspectorUtils.compare(Object, ObjectInspector, Object, ObjectInspector) 
line: 668
ListObjectsEqualComparer$FieldComparer.areEqual(Object, Object) line: 127   
ListObjectsEqualComparer.areEqual(Object[], Object[]) line: 172 
KeyWrapperFactory$ListKeyWrapper.equals(Object) line: 101   
HashMapK,V.getEntry(Object) line: 467 
HashMapK,V.get(Object) line: 417  
GroupByOperator.processHashAggr(Object, ObjectInspector, KeyWrapper) line: 777  
GroupByOperator.processKey(Object, ObjectInspector) line: 693   
GroupByOperator.process(Object, int) line: 761  
SelectOperator(OperatorT).forward(Object, ObjectInspector) line: 837  
SelectOperator.process(Object, int) line: 88
TableScanOperator(OperatorT).forward(Object, ObjectInspector) line: 837   
TableScanOperator.process(Object, int) line: 97 
MapOperator$MapOpCtx.forward(Object) line: 162  
MapOperator.process(Writable) line: 508 
ExecMapper.map(Object, Object, OutputCollector, Reporter) line: 163 
{noformat}
It seems that heavily used 
PrimitiveObjectInspectorUtils.getTypeEntryFromTypeName(String) slows down the 
query. So I change the code to store the PrimitiveTypeEntry as instance 
variable in PrimitiveTypeInfo. This does improve  the performance a lot, now 
the query can finish in 1 hour. But it is still very slow.
I checked 0.13.1 code, it uses Hashmap too, but much much faster. 
I do not know why the HashMap search is so slow in master branch(and 1.1 or 
later version). 


 Map side aggregation is extremely slow
 --

 Key: HIVE-11502
 URL: https://issues.apache.org/jira/browse/HIVE-11502
 Project: Hive
  Issue Type: Bug
  Components: Logical Optimizer, Physical Optimizer
Affects Versions: 1.2.0
Reporter: Yongzhi Chen
Assignee: Yongzhi Chen

 For the query as following:
 {noformat}
 create table tbl2 as 
 select col1, max(col2) as col2 
 from tbl1 group by col1;
 {noformat}
 If the column for group by has many different values (for example 40), 
 the map side aggregation is very slow. I ran the query which took more than 3 
 hours , after 3 hours, I have to kill the query.
 The same query can finish in 7 seconds, if I turn off map side aggregation by:
 {noformat}
 set hive.map.aggr = false;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)