[jira] [Commented] (HIVE-11794) GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly

2015-09-21 Thread Matt McCline (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901605#comment-14901605
 ] 

Matt McCline commented on HIVE-11794:
-

+1

> GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly
> -
>
> Key: HIVE-11794
> URL: https://issues.apache.org/jira/browse/HIVE-11794
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-11794.01.patch, HIVE-11794.patch
>
>
> The code in Vectorizer is as such:
> {noformat}
> boolean isMergePartial = (desc.getMode() != GroupByDesc.Mode.HASH);
> {noformat}
> then, if it's reduce side:
> {noformat}
> if (isMergePartial) {
> // Reduce Merge-Partial GROUP BY.
> // A merge-partial GROUP BY is fed by grouping by keys from 
> reduce-shuffle.  It is the
> // first (or root) operator for its reduce task.
> 
>   } else {
> // Reduce Hash GROUP BY or global aggregation.
> ...
> {noformat}
> In fact, this logic is missing the COMPLETE mode. Both from the comment:
> {noformat}
>  COMPLETE: complete 1-phase aggregation: iterate, terminate
> ...
> HASH: For non-distinct the same as PARTIAL1 but use hash-table-based 
> aggregation
> ...
> PARTIAL1: partial aggregation - first phase: iterate, terminatePartial
> {noformat}
> and from the explain plan like this (the query has multiple stages of 
> aggregations over a union; the mapper does a partial hash aggregation for 
> each side of the union, which is then followed by mergepartial, and 2nd stage 
> as complete):
> {noformat}
> Map Operator Tree:
> ...
> Group By Operator
>   keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), 
> _col3 (type: int), _col4 (type: int), _col5 (type: bigint), _col6 (type: 
> bigint), _col7 (type: bigint), _col8 (type: bigint), _col9 (type: bigint), 
> _col10 (type: bigint), _col11 (type: bigint), _col12 (type: bigint)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
>   Reduce Output Operator
> ...
> feeding into
> Reduce Operator Tree:
>   Group By Operator
> keys: KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: 
> int), KEY._col3 (type: int), KEY._col4 (type: int), KEY._col5 (type: bigint), 
> KEY._col6 (type: bigint), KEY._col7 (type: bigint), KEY._col8 (type: bigint), 
> KEY._col9 (type: bigint), KEY._col10 (type: bigint), KEY._col11 (type: 
> bigint), KEY._col12 (type: bigint)
> mode: mergepartial
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
> Group By Operator
>   aggregations: sum(_col5), sum(_col6), sum(_col7), sum(_col8), 
> sum(_col9), sum(_col10), sum(_col11), sum(_col12)
>   keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 
> (type: int), _col4 (type: int)
>   mode: complete
>   outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
> {noformat}
> it seems like COMPLETE is actually the global aggregation, and HASH isn't (or 
> may not be).
> So, it seems like reduce-side COMPLETE should be handled on the else-path of 
> the above if. For map-side, it doesn't check mode at all as far as I can see.
> Not sure if additional code changes are necessary after that, it may just 
> work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11794) GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly

2015-09-15 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746580#comment-14746580
 ] 

Hive QA commented on HIVE-11794:




{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12756107/HIVE-11794.01.patch

{color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 9444 tests executed
*Failed tests:*
{noformat}
org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation
org.apache.hive.hcatalog.streaming.TestStreaming.testInterleavedTransactionBatchCommits
org.apache.hive.hcatalog.streaming.TestStreaming.testTransactionBatchEmptyCommit
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/5287/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/5287/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-5287/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 3 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12756107 - PreCommit-HIVE-TRUNK-Build

> GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly
> -
>
> Key: HIVE-11794
> URL: https://issues.apache.org/jira/browse/HIVE-11794
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-11794.01.patch, HIVE-11794.patch
>
>
> The code in Vectorizer is as such:
> {noformat}
> boolean isMergePartial = (desc.getMode() != GroupByDesc.Mode.HASH);
> {noformat}
> then, if it's reduce side:
> {noformat}
> if (isMergePartial) {
> // Reduce Merge-Partial GROUP BY.
> // A merge-partial GROUP BY is fed by grouping by keys from 
> reduce-shuffle.  It is the
> // first (or root) operator for its reduce task.
> 
>   } else {
> // Reduce Hash GROUP BY or global aggregation.
> ...
> {noformat}
> In fact, this logic is missing the COMPLETE mode. Both from the comment:
> {noformat}
>  COMPLETE: complete 1-phase aggregation: iterate, terminate
> ...
> HASH: For non-distinct the same as PARTIAL1 but use hash-table-based 
> aggregation
> ...
> PARTIAL1: partial aggregation - first phase: iterate, terminatePartial
> {noformat}
> and from the explain plan like this (the query has multiple stages of 
> aggregations over a union; the mapper does a partial hash aggregation for 
> each side of the union, which is then followed by mergepartial, and 2nd stage 
> as complete):
> {noformat}
> Map Operator Tree:
> ...
> Group By Operator
>   keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), 
> _col3 (type: int), _col4 (type: int), _col5 (type: bigint), _col6 (type: 
> bigint), _col7 (type: bigint), _col8 (type: bigint), _col9 (type: bigint), 
> _col10 (type: bigint), _col11 (type: bigint), _col12 (type: bigint)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
>   Reduce Output Operator
> ...
> feeding into
> Reduce Operator Tree:
>   Group By Operator
> keys: KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: 
> int), KEY._col3 (type: int), KEY._col4 (type: int), KEY._col5 (type: bigint), 
> KEY._col6 (type: bigint), KEY._col7 (type: bigint), KEY._col8 (type: bigint), 
> KEY._col9 (type: bigint), KEY._col10 (type: bigint), KEY._col11 (type: 
> bigint), KEY._col12 (type: bigint)
> mode: mergepartial
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
> Group By Operator
>   aggregations: sum(_col5), sum(_col6), sum(_col7), sum(_col8), 
> sum(_col9), sum(_col10), sum(_col11), sum(_col12)
>   keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 
> (type: int), _col4 (type: int)
>   mode: complete
>   outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
> {noformat}
> it seems like COMPLETE is actually the global aggregation, and HASH isn't (or 
> may not be).
> So, it seems like reduce-side COMPLETE should be handled on the else-path of 
> the above if. For map-side, it doesn't check mode at all as far as I can see.
> Not sure if additional code changes are necessary after that, it may just 
> work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11794) GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly

2015-09-15 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746615#comment-14746615
 ] 

Sergey Shelukhin commented on HIVE-11794:
-

[~mmccline] can you take a look? It adds special handling for "complete" GBY 
which basically just reuses unordered-streaming. Technically, the streaming is 
ordered but it doesn't matter. It seems to work for existing test cases and 
some more cases.

> GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly
> -
>
> Key: HIVE-11794
> URL: https://issues.apache.org/jira/browse/HIVE-11794
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-11794.01.patch, HIVE-11794.patch
>
>
> The code in Vectorizer is as such:
> {noformat}
> boolean isMergePartial = (desc.getMode() != GroupByDesc.Mode.HASH);
> {noformat}
> then, if it's reduce side:
> {noformat}
> if (isMergePartial) {
> // Reduce Merge-Partial GROUP BY.
> // A merge-partial GROUP BY is fed by grouping by keys from 
> reduce-shuffle.  It is the
> // first (or root) operator for its reduce task.
> 
>   } else {
> // Reduce Hash GROUP BY or global aggregation.
> ...
> {noformat}
> In fact, this logic is missing the COMPLETE mode. Both from the comment:
> {noformat}
>  COMPLETE: complete 1-phase aggregation: iterate, terminate
> ...
> HASH: For non-distinct the same as PARTIAL1 but use hash-table-based 
> aggregation
> ...
> PARTIAL1: partial aggregation - first phase: iterate, terminatePartial
> {noformat}
> and from the explain plan like this (the query has multiple stages of 
> aggregations over a union; the mapper does a partial hash aggregation for 
> each side of the union, which is then followed by mergepartial, and 2nd stage 
> as complete):
> {noformat}
> Map Operator Tree:
> ...
> Group By Operator
>   keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), 
> _col3 (type: int), _col4 (type: int), _col5 (type: bigint), _col6 (type: 
> bigint), _col7 (type: bigint), _col8 (type: bigint), _col9 (type: bigint), 
> _col10 (type: bigint), _col11 (type: bigint), _col12 (type: bigint)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
>   Reduce Output Operator
> ...
> feeding into
> Reduce Operator Tree:
>   Group By Operator
> keys: KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: 
> int), KEY._col3 (type: int), KEY._col4 (type: int), KEY._col5 (type: bigint), 
> KEY._col6 (type: bigint), KEY._col7 (type: bigint), KEY._col8 (type: bigint), 
> KEY._col9 (type: bigint), KEY._col10 (type: bigint), KEY._col11 (type: 
> bigint), KEY._col12 (type: bigint)
> mode: mergepartial
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
> Group By Operator
>   aggregations: sum(_col5), sum(_col6), sum(_col7), sum(_col8), 
> sum(_col9), sum(_col10), sum(_col11), sum(_col12)
>   keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 
> (type: int), _col4 (type: int)
>   mode: complete
>   outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
> {noformat}
> it seems like COMPLETE is actually the global aggregation, and HASH isn't (or 
> may not be).
> So, it seems like reduce-side COMPLETE should be handled on the else-path of 
> the above if. For map-side, it doesn't check mode at all as far as I can see.
> Not sure if additional code changes are necessary after that, it may just 
> work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11794) GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly

2015-09-15 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14746379#comment-14746379
 ] 

Sergey Shelukhin commented on HIVE-11794:
-

Hmm, I wonder where I attached the patch?

> GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly
> -
>
> Key: HIVE-11794
> URL: https://issues.apache.org/jira/browse/HIVE-11794
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-11794.01.patch, HIVE-11794.patch
>
>
> The code in Vectorizer is as such:
> {noformat}
> boolean isMergePartial = (desc.getMode() != GroupByDesc.Mode.HASH);
> {noformat}
> then, if it's reduce side:
> {noformat}
> if (isMergePartial) {
> // Reduce Merge-Partial GROUP BY.
> // A merge-partial GROUP BY is fed by grouping by keys from 
> reduce-shuffle.  It is the
> // first (or root) operator for its reduce task.
> 
>   } else {
> // Reduce Hash GROUP BY or global aggregation.
> ...
> {noformat}
> In fact, this logic is missing the COMPLETE mode. Both from the comment:
> {noformat}
>  COMPLETE: complete 1-phase aggregation: iterate, terminate
> ...
> HASH: For non-distinct the same as PARTIAL1 but use hash-table-based 
> aggregation
> ...
> PARTIAL1: partial aggregation - first phase: iterate, terminatePartial
> {noformat}
> and from the explain plan like this (the query has multiple stages of 
> aggregations over a union; the mapper does a partial hash aggregation for 
> each side of the union, which is then followed by mergepartial, and 2nd stage 
> as complete):
> {noformat}
> Map Operator Tree:
> ...
> Group By Operator
>   keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), 
> _col3 (type: int), _col4 (type: int), _col5 (type: bigint), _col6 (type: 
> bigint), _col7 (type: bigint), _col8 (type: bigint), _col9 (type: bigint), 
> _col10 (type: bigint), _col11 (type: bigint), _col12 (type: bigint)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
>   Reduce Output Operator
> ...
> feeding into
> Reduce Operator Tree:
>   Group By Operator
> keys: KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: 
> int), KEY._col3 (type: int), KEY._col4 (type: int), KEY._col5 (type: bigint), 
> KEY._col6 (type: bigint), KEY._col7 (type: bigint), KEY._col8 (type: bigint), 
> KEY._col9 (type: bigint), KEY._col10 (type: bigint), KEY._col11 (type: 
> bigint), KEY._col12 (type: bigint)
> mode: mergepartial
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
> Group By Operator
>   aggregations: sum(_col5), sum(_col6), sum(_col7), sum(_col8), 
> sum(_col9), sum(_col10), sum(_col11), sum(_col12)
>   keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 
> (type: int), _col4 (type: int)
>   mode: complete
>   outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
> {noformat}
> it seems like COMPLETE is actually the global aggregation, and HASH isn't (or 
> may not be).
> So, it seems like reduce-side COMPLETE should be handled on the else-path of 
> the above if. For map-side, it doesn't check mode at all as far as I can see.
> Not sure if additional code changes are necessary after that, it may just 
> work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11794) GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly

2015-09-14 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744009#comment-14744009
 ] 

Sergey Shelukhin commented on HIVE-11794:
-

I'll take a look

> GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly
> -
>
> Key: HIVE-11794
> URL: https://issues.apache.org/jira/browse/HIVE-11794
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-11794.patch
>
>
> The code in Vectorizer is as such:
> {noformat}
> boolean isMergePartial = (desc.getMode() != GroupByDesc.Mode.HASH);
> {noformat}
> then, if it's reduce side:
> {noformat}
> if (isMergePartial) {
> // Reduce Merge-Partial GROUP BY.
> // A merge-partial GROUP BY is fed by grouping by keys from 
> reduce-shuffle.  It is the
> // first (or root) operator for its reduce task.
> 
>   } else {
> // Reduce Hash GROUP BY or global aggregation.
> ...
> {noformat}
> In fact, this logic is missing the COMPLETE mode. Both from the comment:
> {noformat}
>  COMPLETE: complete 1-phase aggregation: iterate, terminate
> ...
> HASH: For non-distinct the same as PARTIAL1 but use hash-table-based 
> aggregation
> ...
> PARTIAL1: partial aggregation - first phase: iterate, terminatePartial
> {noformat}
> and from the explain plan like this (the query has multiple stages of 
> aggregations over a union; the mapper does a partial hash aggregation for 
> each side of the union, which is then followed by mergepartial, and 2nd stage 
> as complete):
> {noformat}
> Map Operator Tree:
> ...
> Group By Operator
>   keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), 
> _col3 (type: int), _col4 (type: int), _col5 (type: bigint), _col6 (type: 
> bigint), _col7 (type: bigint), _col8 (type: bigint), _col9 (type: bigint), 
> _col10 (type: bigint), _col11 (type: bigint), _col12 (type: bigint)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
>   Reduce Output Operator
> ...
> feeding into
> Reduce Operator Tree:
>   Group By Operator
> keys: KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: 
> int), KEY._col3 (type: int), KEY._col4 (type: int), KEY._col5 (type: bigint), 
> KEY._col6 (type: bigint), KEY._col7 (type: bigint), KEY._col8 (type: bigint), 
> KEY._col9 (type: bigint), KEY._col10 (type: bigint), KEY._col11 (type: 
> bigint), KEY._col12 (type: bigint)
> mode: mergepartial
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
> Group By Operator
>   aggregations: sum(_col5), sum(_col6), sum(_col7), sum(_col8), 
> sum(_col9), sum(_col10), sum(_col11), sum(_col12)
>   keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 
> (type: int), _col4 (type: int)
>   mode: complete
>   outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
> {noformat}
> it seems like COMPLETE is actually the global aggregation, and HASH isn't (or 
> may not be).
> So, it seems like reduce-side COMPLETE should be handled on the else-path of 
> the above if. For map-side, it doesn't check mode at all as far as I can see.
> Not sure if additional code changes are necessary after that, it may just 
> work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11794) GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly

2015-09-12 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742226#comment-14742226
 ] 

Hive QA commented on HIVE-11794:




{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12755514/HIVE-11794.patch

{color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 9416 tests executed
*Failed tests:*
{noformat}
TestSparkClient - did not produce a TEST-*.xml file
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vectorization_limit
org.apache.hadoop.hive.ql.optimizer.physical.TestVectorizer.testValidateNestedExpressions
org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/5259/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/5259/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-5259/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 4 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12755514 - PreCommit-HIVE-TRUNK-Build

> GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly
> -
>
> Key: HIVE-11794
> URL: https://issues.apache.org/jira/browse/HIVE-11794
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-11794.patch
>
>
> The code in Vectorizer is as such:
> {noformat}
> boolean isMergePartial = (desc.getMode() != GroupByDesc.Mode.HASH);
> {noformat}
> then, if it's reduce side:
> {noformat}
> if (isMergePartial) {
> // Reduce Merge-Partial GROUP BY.
> // A merge-partial GROUP BY is fed by grouping by keys from 
> reduce-shuffle.  It is the
> // first (or root) operator for its reduce task.
> 
>   } else {
> // Reduce Hash GROUP BY or global aggregation.
> ...
> {noformat}
> In fact, this logic is missing the COMPLETE mode. Both from the comment:
> {noformat}
>  COMPLETE: complete 1-phase aggregation: iterate, terminate
> ...
> HASH: For non-distinct the same as PARTIAL1 but use hash-table-based 
> aggregation
> ...
> PARTIAL1: partial aggregation - first phase: iterate, terminatePartial
> {noformat}
> and from the explain plan like this (the query has multiple stages of 
> aggregations over a union; the mapper does a partial hash aggregation for 
> each side of the union, which is then followed by mergepartial, and 2nd stage 
> as complete):
> {noformat}
> Map Operator Tree:
> ...
> Group By Operator
>   keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), 
> _col3 (type: int), _col4 (type: int), _col5 (type: bigint), _col6 (type: 
> bigint), _col7 (type: bigint), _col8 (type: bigint), _col9 (type: bigint), 
> _col10 (type: bigint), _col11 (type: bigint), _col12 (type: bigint)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
>   Reduce Output Operator
> ...
> feeding into
> Reduce Operator Tree:
>   Group By Operator
> keys: KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: 
> int), KEY._col3 (type: int), KEY._col4 (type: int), KEY._col5 (type: bigint), 
> KEY._col6 (type: bigint), KEY._col7 (type: bigint), KEY._col8 (type: bigint), 
> KEY._col9 (type: bigint), KEY._col10 (type: bigint), KEY._col11 (type: 
> bigint), KEY._col12 (type: bigint)
> mode: mergepartial
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
> Group By Operator
>   aggregations: sum(_col5), sum(_col6), sum(_col7), sum(_col8), 
> sum(_col9), sum(_col10), sum(_col11), sum(_col12)
>   keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 
> (type: int), _col4 (type: int)
>   mode: complete
>   outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
> {noformat}
> it seems like COMPLETE is actually the global aggregation, and HASH isn't (or 
> may not be).
> So, it seems like reduce-side COMPLETE should be handled on the else-path of 
> the above if. For map-side, it doesn't check mode at all as far as I can see.
> Not sure if additional code changes are necessary after that, it may just 
> work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11794) GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly

2015-09-11 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14741719#comment-14741719
 ] 

Sergey Shelukhin commented on HIVE-11794:
-

[~mmccline] can you take a look?

> GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly
> -
>
> Key: HIVE-11794
> URL: https://issues.apache.org/jira/browse/HIVE-11794
> Project: Hive
>  Issue Type: Bug
>Reporter: Sergey Shelukhin
>Assignee: Sergey Shelukhin
> Attachments: HIVE-11794.patch
>
>
> The code in Vectorizer is as such:
> {noformat}
> boolean isMergePartial = (desc.getMode() != GroupByDesc.Mode.HASH);
> {noformat}
> then, if it's reduce side:
> {noformat}
> if (isMergePartial) {
> // Reduce Merge-Partial GROUP BY.
> // A merge-partial GROUP BY is fed by grouping by keys from 
> reduce-shuffle.  It is the
> // first (or root) operator for its reduce task.
> 
>   } else {
> // Reduce Hash GROUP BY or global aggregation.
> ...
> {noformat}
> In fact, this logic is missing the COMPLETE mode. Both from the comment:
> {noformat}
>  COMPLETE: complete 1-phase aggregation: iterate, terminate
> ...
> HASH: For non-distinct the same as PARTIAL1 but use hash-table-based 
> aggregation
> ...
> PARTIAL1: partial aggregation - first phase: iterate, terminatePartial
> {noformat}
> and from the explain plan like this (the query has multiple stages of 
> aggregations over a union; the mapper does a partial hash aggregation for 
> each side of the union, which is then followed by mergepartial, and 2nd stage 
> as complete):
> {noformat}
> Map Operator Tree:
> ...
> Group By Operator
>   keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), 
> _col3 (type: int), _col4 (type: int), _col5 (type: bigint), _col6 (type: 
> bigint), _col7 (type: bigint), _col8 (type: bigint), _col9 (type: bigint), 
> _col10 (type: bigint), _col11 (type: bigint), _col12 (type: bigint)
>   mode: hash
>   outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
>   Reduce Output Operator
> ...
> feeding into
> Reduce Operator Tree:
>   Group By Operator
> keys: KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: 
> int), KEY._col3 (type: int), KEY._col4 (type: int), KEY._col5 (type: bigint), 
> KEY._col6 (type: bigint), KEY._col7 (type: bigint), KEY._col8 (type: bigint), 
> KEY._col9 (type: bigint), KEY._col10 (type: bigint), KEY._col11 (type: 
> bigint), KEY._col12 (type: bigint)
> mode: mergepartial
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
> Group By Operator
>   aggregations: sum(_col5), sum(_col6), sum(_col7), sum(_col8), 
> sum(_col9), sum(_col10), sum(_col11), sum(_col12)
>   keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 
> (type: int), _col4 (type: int)
>   mode: complete
>   outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
> _col7, _col8, _col9, _col10, _col11, _col12
> {noformat}
> it seems like COMPLETE is actually the global aggregation, and HASH isn't (or 
> may not be).
> So, it seems like reduce-side COMPLETE should be handled on the else-path of 
> the above if. For map-side, it doesn't check mode at all as far as I can see.
> Not sure if additional code changes are necessary after that, it may just 
> work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)