[jira] [Commented] (HIVE-4002) Fetch task aggregation for simple group by query
[ https://issues.apache.org/jira/browse/HIVE-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14052987#comment-14052987 ] Lefty Leverenz commented on HIVE-4002: -- *hive.fetch.task.aggr* is documented in the wiki here: * [Configuration Properties -- hive.fetch.task.aggr | https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.fetch.task.aggr] Also see doc comments on HIVE-5793 (Update hive-default.xml.template for HIVE-4002). Fetch task aggregation for simple group by query Key: HIVE-4002 URL: https://issues.apache.org/jira/browse/HIVE-4002 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Navis Assignee: Navis Priority: Minor Fix For: 0.12.0 Attachments: HIVE-4002.D8739.1.patch, HIVE-4002.D8739.2.patch, HIVE-4002.D8739.3.patch, HIVE-4002.D8739.4.patch, HIVE-4002.patch Aggregation queries with no group-by clause (for example, select count(*) from src) executes final aggregation in single reduce task. But it's too small even for single reducer because the most of UDAF generates just single row for map aggregation. If final fetch task can aggregate outputs from map tasks, shuffling time can be removed. This optimization transforms operator tree something like, TS-FIL-SEL-GBY1-RS-GBY2-SEL-FS + FETCH-TASK into TS-FIL-SEL-GBY1-FS + FETCH-TASK(GBY2-SEL-LS) With the patch, time taken for auto_join_filters.q test reduced to 6 min (10 min, before). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-4002) Fetch task aggregation for simple group by query
[ https://issues.apache.org/jira/browse/HIVE-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756015#comment-13756015 ] Hudson commented on HIVE-4002: -- FAILURE: Integrated in Hive-trunk-h0.21 #2303 (See [https://builds.apache.org/job/Hive-trunk-h0.21/2303/]) HIVE-4002 Fetch task aggregation for simple group by query (Navis Ryu and Yin Huai via egc) (ecapriolo: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1519306) * /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/JoinOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PartitionKeySampler.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/UDTFOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SimpleFetchAggregation.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SimpleFetchOptimizer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/MapReduceCompiler.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/RowResolver.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java * /hive/trunk/ql/src/test/queries/clientpositive/fetch_aggregation.q * /hive/trunk/ql/src/test/results/clientpositive/fetch_aggregation.q.out * /hive/trunk/ql/src/test/results/compiler/plan/groupby1.q.xml * /hive/trunk/ql/src/test/results/compiler/plan/groupby2.q.xml * /hive/trunk/ql/src/test/results/compiler/plan/groupby3.q.xml * /hive/trunk/ql/src/test/results/compiler/plan/groupby5.q.xml * /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeUtils.java Fetch task aggregation for simple group by query Key: HIVE-4002 URL: https://issues.apache.org/jira/browse/HIVE-4002 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Navis Assignee: Navis Priority: Minor Fix For: 0.12.0 Attachments: HIVE-4002.D8739.1.patch, HIVE-4002.D8739.2.patch, HIVE-4002.D8739.3.patch, HIVE-4002.D8739.4.patch, HIVE-4002.patch Aggregation queries with no group-by clause (for example, select count(*) from src) executes final aggregation in single reduce task. But it's too small even for single reducer because the most of UDAF generates just single row for map aggregation. If final fetch task can aggregate outputs from map tasks, shuffling time can be removed. This optimization transforms operator tree something like, TS-FIL-SEL-GBY1-RS-GBY2-SEL-FS + FETCH-TASK into TS-FIL-SEL-GBY1-FS + FETCH-TASK(GBY2-SEL-LS) With the patch, time taken for auto_join_filters.q test reduced to 6 min (10 min, before). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4002) Fetch task aggregation for simple group by query
[ https://issues.apache.org/jira/browse/HIVE-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13755785#comment-13755785 ] Hive QA commented on HIVE-4002: --- {color:green}Overall{color}: +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12600993/HIVE-4002.patch {color:green}SUCCESS:{color} +1 2903 tests passed Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/588/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/588/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. Fetch task aggregation for simple group by query Key: HIVE-4002 URL: https://issues.apache.org/jira/browse/HIVE-4002 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Navis Assignee: Navis Priority: Minor Attachments: HIVE-4002.D8739.1.patch, HIVE-4002.D8739.2.patch, HIVE-4002.D8739.3.patch, HIVE-4002.D8739.4.patch, HIVE-4002.patch Aggregation queries with no group-by clause (for example, select count(*) from src) executes final aggregation in single reduce task. But it's too small even for single reducer because the most of UDAF generates just single row for map aggregation. If final fetch task can aggregate outputs from map tasks, shuffling time can be removed. This optimization transforms operator tree something like, TS-FIL-SEL-GBY1-RS-GBY2-SEL-FS + FETCH-TASK into TS-FIL-SEL-GBY1-FS + FETCH-TASK(GBY2-SEL-LS) With the patch, time taken for auto_join_filters.q test reduced to 6 min (10 min, before). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4002) Fetch task aggregation for simple group by query
[ https://issues.apache.org/jira/browse/HIVE-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13755790#comment-13755790 ] Edward Capriolo commented on HIVE-4002: --- +1. With all tests passing, we should commit. If we get caught up in another re-base it could be weeks before we get it all settled out again. This feature is off by default, so if there is an issue with it we can tackle it in a follow up. Fetch task aggregation for simple group by query Key: HIVE-4002 URL: https://issues.apache.org/jira/browse/HIVE-4002 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Navis Assignee: Navis Priority: Minor Attachments: HIVE-4002.D8739.1.patch, HIVE-4002.D8739.2.patch, HIVE-4002.D8739.3.patch, HIVE-4002.D8739.4.patch, HIVE-4002.patch Aggregation queries with no group-by clause (for example, select count(*) from src) executes final aggregation in single reduce task. But it's too small even for single reducer because the most of UDAF generates just single row for map aggregation. If final fetch task can aggregate outputs from map tasks, shuffling time can be removed. This optimization transforms operator tree something like, TS-FIL-SEL-GBY1-RS-GBY2-SEL-FS + FETCH-TASK into TS-FIL-SEL-GBY1-FS + FETCH-TASK(GBY2-SEL-LS) With the patch, time taken for auto_join_filters.q test reduced to 6 min (10 min, before). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4002) Fetch task aggregation for simple group by query
[ https://issues.apache.org/jira/browse/HIVE-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13755821#comment-13755821 ] Hudson commented on HIVE-4002: -- FAILURE: Integrated in Hive-trunk-hadoop2-ptest #80 (See [https://builds.apache.org/job/Hive-trunk-hadoop2-ptest/80/]) HIVE-4002 Fetch task aggregation for simple group by query (Navis Ryu and Yin Huai via egc) (ecapriolo: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1519306) * /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/JoinOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PartitionKeySampler.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/UDTFOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SimpleFetchAggregation.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SimpleFetchOptimizer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/MapReduceCompiler.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/RowResolver.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java * /hive/trunk/ql/src/test/queries/clientpositive/fetch_aggregation.q * /hive/trunk/ql/src/test/results/clientpositive/fetch_aggregation.q.out * /hive/trunk/ql/src/test/results/compiler/plan/groupby1.q.xml * /hive/trunk/ql/src/test/results/compiler/plan/groupby2.q.xml * /hive/trunk/ql/src/test/results/compiler/plan/groupby3.q.xml * /hive/trunk/ql/src/test/results/compiler/plan/groupby5.q.xml * /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeUtils.java Fetch task aggregation for simple group by query Key: HIVE-4002 URL: https://issues.apache.org/jira/browse/HIVE-4002 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Navis Assignee: Navis Priority: Minor Fix For: 0.12.0 Attachments: HIVE-4002.D8739.1.patch, HIVE-4002.D8739.2.patch, HIVE-4002.D8739.3.patch, HIVE-4002.D8739.4.patch, HIVE-4002.patch Aggregation queries with no group-by clause (for example, select count(*) from src) executes final aggregation in single reduce task. But it's too small even for single reducer because the most of UDAF generates just single row for map aggregation. If final fetch task can aggregate outputs from map tasks, shuffling time can be removed. This optimization transforms operator tree something like, TS-FIL-SEL-GBY1-RS-GBY2-SEL-FS + FETCH-TASK into TS-FIL-SEL-GBY1-FS + FETCH-TASK(GBY2-SEL-LS) With the patch, time taken for auto_join_filters.q test reduced to 6 min (10 min, before). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4002) Fetch task aggregation for simple group by query
[ https://issues.apache.org/jira/browse/HIVE-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13755850#comment-13755850 ] Hudson commented on HIVE-4002: -- FAILURE: Integrated in Hive-trunk-hadoop1-ptest #147 (See [https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/147/]) HIVE-4002 Fetch task aggregation for simple group by query (Navis Ryu and Yin Huai via egc) (ecapriolo: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1519306) * /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/JoinOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PartitionKeySampler.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/UDTFOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SimpleFetchAggregation.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SimpleFetchOptimizer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/MapReduceCompiler.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/RowResolver.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java * /hive/trunk/ql/src/test/queries/clientpositive/fetch_aggregation.q * /hive/trunk/ql/src/test/results/clientpositive/fetch_aggregation.q.out * /hive/trunk/ql/src/test/results/compiler/plan/groupby1.q.xml * /hive/trunk/ql/src/test/results/compiler/plan/groupby2.q.xml * /hive/trunk/ql/src/test/results/compiler/plan/groupby3.q.xml * /hive/trunk/ql/src/test/results/compiler/plan/groupby5.q.xml * /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeUtils.java Fetch task aggregation for simple group by query Key: HIVE-4002 URL: https://issues.apache.org/jira/browse/HIVE-4002 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Navis Assignee: Navis Priority: Minor Fix For: 0.12.0 Attachments: HIVE-4002.D8739.1.patch, HIVE-4002.D8739.2.patch, HIVE-4002.D8739.3.patch, HIVE-4002.D8739.4.patch, HIVE-4002.patch Aggregation queries with no group-by clause (for example, select count(*) from src) executes final aggregation in single reduce task. But it's too small even for single reducer because the most of UDAF generates just single row for map aggregation. If final fetch task can aggregate outputs from map tasks, shuffling time can be removed. This optimization transforms operator tree something like, TS-FIL-SEL-GBY1-RS-GBY2-SEL-FS + FETCH-TASK into TS-FIL-SEL-GBY1-FS + FETCH-TASK(GBY2-SEL-LS) With the patch, time taken for auto_join_filters.q test reduced to 6 min (10 min, before). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4002) Fetch task aggregation for simple group by query
[ https://issues.apache.org/jira/browse/HIVE-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13755860#comment-13755860 ] Hudson commented on HIVE-4002: -- FAILURE: Integrated in Hive-trunk-hadoop2 #395 (See [https://builds.apache.org/job/Hive-trunk-hadoop2/395/]) HIVE-4002 Fetch task aggregation for simple group by query (Navis Ryu and Yin Huai via egc) (ecapriolo: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1519306) * /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/JoinOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/PartitionKeySampler.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/UDTFOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SimpleFetchAggregation.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SimpleFetchOptimizer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/MapReduceCompiler.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/RowResolver.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java * /hive/trunk/ql/src/test/queries/clientpositive/fetch_aggregation.q * /hive/trunk/ql/src/test/results/clientpositive/fetch_aggregation.q.out * /hive/trunk/ql/src/test/results/compiler/plan/groupby1.q.xml * /hive/trunk/ql/src/test/results/compiler/plan/groupby2.q.xml * /hive/trunk/ql/src/test/results/compiler/plan/groupby3.q.xml * /hive/trunk/ql/src/test/results/compiler/plan/groupby5.q.xml * /hive/trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeUtils.java Fetch task aggregation for simple group by query Key: HIVE-4002 URL: https://issues.apache.org/jira/browse/HIVE-4002 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Navis Assignee: Navis Priority: Minor Fix For: 0.12.0 Attachments: HIVE-4002.D8739.1.patch, HIVE-4002.D8739.2.patch, HIVE-4002.D8739.3.patch, HIVE-4002.D8739.4.patch, HIVE-4002.patch Aggregation queries with no group-by clause (for example, select count(*) from src) executes final aggregation in single reduce task. But it's too small even for single reducer because the most of UDAF generates just single row for map aggregation. If final fetch task can aggregate outputs from map tasks, shuffling time can be removed. This optimization transforms operator tree something like, TS-FIL-SEL-GBY1-RS-GBY2-SEL-FS + FETCH-TASK into TS-FIL-SEL-GBY1-FS + FETCH-TASK(GBY2-SEL-LS) With the patch, time taken for auto_join_filters.q test reduced to 6 min (10 min, before). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4002) Fetch task aggregation for simple group by query
[ https://issues.apache.org/jira/browse/HIVE-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13755530#comment-13755530 ] Edward Capriolo commented on HIVE-4002: --- [~yhuai][~navis] Are you two discussing possible revisions or is this patch ready to be committed? Fetch task aggregation for simple group by query Key: HIVE-4002 URL: https://issues.apache.org/jira/browse/HIVE-4002 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Navis Assignee: Navis Priority: Minor Attachments: HIVE-4002.D8739.1.patch, HIVE-4002.D8739.2.patch, HIVE-4002.D8739.3.patch, HIVE-4002.D8739.4.patch Aggregation queries with no group-by clause (for example, select count(*) from src) executes final aggregation in single reduce task. But it's too small even for single reducer because the most of UDAF generates just single row for map aggregation. If final fetch task can aggregate outputs from map tasks, shuffling time can be removed. This optimization transforms operator tree something like, TS-FIL-SEL-GBY1-RS-GBY2-SEL-FS + FETCH-TASK into TS-FIL-SEL-GBY1-FS + FETCH-TASK(GBY2-SEL-LS) With the patch, time taken for auto_join_filters.q test reduced to 6 min (10 min, before). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4002) Fetch task aggregation for simple group by query
[ https://issues.apache.org/jira/browse/HIVE-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13751330#comment-13751330 ] Phabricator commented on HIVE-4002: --- yhuai has commented on the revision HIVE-4002 [jira] Fetch task aggregation for simple group by query. INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java:493 I think that flush is only needed for blocking operators. With this optimization, the operator tree in the fetch task seems only have a single blocking operator which is GBY. Since GBY is the first operator in the fetch task (the operator shown in flush() in this class), I do not think we need to call all operators in the operator tree. Is that possible GBY is not the first operator? ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6985 there are other places where we are using colInfo.getInternalName(). I think it is better to also change those places if we want to use field. ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java:582 Let's say we have a chain of operators OP1-OP2-OP3. With this change, when flush in OP1 is called, it will call its flushOp and then call flushOp in OP2. Seems flush or flushOp in OP3 will never be called. Also, when I introduced flush with Correlation Optimizer, this method was not designed to propagate the signal to its children. REVISION DETAIL https://reviews.facebook.net/D8739 To: JIRA, navis Cc: yhuai Fetch task aggregation for simple group by query Key: HIVE-4002 URL: https://issues.apache.org/jira/browse/HIVE-4002 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Navis Assignee: Navis Priority: Minor Attachments: HIVE-4002.D8739.1.patch, HIVE-4002.D8739.2.patch, HIVE-4002.D8739.3.patch, HIVE-4002.D8739.4.patch Aggregation queries with no group-by clause (for example, select count(*) from src) executes final aggregation in single reduce task. But it's too small even for single reducer because the most of UDAF generates just single row for map aggregation. If final fetch task can aggregate outputs from map tasks, shuffling time can be removed. This optimization transforms operator tree something like, TS-FIL-SEL-GBY1-RS-GBY2-SEL-FS + FETCH-TASK into TS-FIL-SEL-GBY1-FS + FETCH-TASK(GBY2-SEL-LS) With the patch, time taken for auto_join_filters.q test reduced to 6 min (10 min, before). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4002) Fetch task aggregation for simple group by query
[ https://issues.apache.org/jira/browse/HIVE-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13751333#comment-13751333 ] Phabricator commented on HIVE-4002: --- yhuai has commented on the revision HIVE-4002 [jira] Fetch task aggregation for simple group by query. INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java:582 I did not mean we cannot have a recursive flush method. I meant that Demux and Mux operators should not use a recursive flush method. REVISION DETAIL https://reviews.facebook.net/D8739 To: JIRA, navis Cc: yhuai Fetch task aggregation for simple group by query Key: HIVE-4002 URL: https://issues.apache.org/jira/browse/HIVE-4002 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Navis Assignee: Navis Priority: Minor Attachments: HIVE-4002.D8739.1.patch, HIVE-4002.D8739.2.patch, HIVE-4002.D8739.3.patch, HIVE-4002.D8739.4.patch Aggregation queries with no group-by clause (for example, select count(*) from src) executes final aggregation in single reduce task. But it's too small even for single reducer because the most of UDAF generates just single row for map aggregation. If final fetch task can aggregate outputs from map tasks, shuffling time can be removed. This optimization transforms operator tree something like, TS-FIL-SEL-GBY1-RS-GBY2-SEL-FS + FETCH-TASK into TS-FIL-SEL-GBY1-FS + FETCH-TASK(GBY2-SEL-LS) With the patch, time taken for auto_join_filters.q test reduced to 6 min (10 min, before). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4002) Fetch task aggregation for simple group by query
[ https://issues.apache.org/jira/browse/HIVE-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13750638#comment-13750638 ] Phabricator commented on HIVE-4002: --- yhuai has commented on the revision HIVE-4002 [jira] Fetch task aggregation for simple group by query. INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3631 Seems that this line is the same as the line 3633 ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6985 Why do we need to change getInternalName to field? If we want to use field instead of getInternalName, can you also make this to other places of this class? ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java:582 why do we need flushOp? I think it is not necessary to have flushOp. Also, can you change an blocking operator to a blocking operator? I am sorry about the typo I made... ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java:493 I think we can just use operator.flush() to tell GBY to process its buffer. REVISION DETAIL https://reviews.facebook.net/D8739 To: JIRA, navis Cc: yhuai Fetch task aggregation for simple group by query Key: HIVE-4002 URL: https://issues.apache.org/jira/browse/HIVE-4002 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Navis Assignee: Navis Priority: Minor Attachments: HIVE-4002.D8739.1.patch, HIVE-4002.D8739.2.patch, HIVE-4002.D8739.3.patch Aggregation queries with no group-by clause (for example, select count(*) from src) executes final aggregation in single reduce task. But it's too small even for single reducer because the most of UDAF generates just single row for map aggregation. If final fetch task can aggregate outputs from map tasks, shuffling time can be removed. This optimization transforms operator tree something like, TS-FIL-SEL-GBY1-RS-GBY2-SEL-FS + FETCH-TASK into TS-FIL-SEL-GBY1-FS + FETCH-TASK(GBY2-SEL-LS) With the patch, time taken for auto_join_filters.q test reduced to 6 min (10 min, before). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4002) Fetch task aggregation for simple group by query
[ https://issues.apache.org/jira/browse/HIVE-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13750945#comment-13750945 ] Phabricator commented on HIVE-4002: --- navis has commented on the revision HIVE-4002 [jira] Fetch task aggregation for simple group by query. INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3631 Right. I'll fix that. ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6985 It's the same thing. I just want to be more consistent. ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java:582 I need recursive flush method for implementing this, like what init or close method does. I think I've broken something rebasing the patch. Can I ask what query was not working with this patch? Test framework seemed not working recently. ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java:493 Flush should be called to all operators in execution tree, for this patch. REVISION DETAIL https://reviews.facebook.net/D8739 To: JIRA, navis Cc: yhuai Fetch task aggregation for simple group by query Key: HIVE-4002 URL: https://issues.apache.org/jira/browse/HIVE-4002 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Navis Assignee: Navis Priority: Minor Attachments: HIVE-4002.D8739.1.patch, HIVE-4002.D8739.2.patch, HIVE-4002.D8739.3.patch Aggregation queries with no group-by clause (for example, select count(*) from src) executes final aggregation in single reduce task. But it's too small even for single reducer because the most of UDAF generates just single row for map aggregation. If final fetch task can aggregate outputs from map tasks, shuffling time can be removed. This optimization transforms operator tree something like, TS-FIL-SEL-GBY1-RS-GBY2-SEL-FS + FETCH-TASK into TS-FIL-SEL-GBY1-FS + FETCH-TASK(GBY2-SEL-LS) With the patch, time taken for auto_join_filters.q test reduced to 6 min (10 min, before). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4002) Fetch task aggregation for simple group by query
[ https://issues.apache.org/jira/browse/HIVE-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13749714#comment-13749714 ] Edward Capriolo commented on HIVE-4002: --- {quote} [edward@jackintosh hive-trunk]$ patch -p0 D8739\?download\=true patching file common/src/java/org/apache/hadoop/hive/conf/HiveConf.java patching file ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java patching file ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java patching file ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java patching file ql/src/java/org/apache/hadoop/hive/ql/exec/JoinOperator.java patching file ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java patching file ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java patching file ql/src/java/org/apache/hadoop/hive/ql/exec/PartitionKeySampler.java patching file ql/src/java/org/apache/hadoop/hive/ql/exec/UDTFOperator.java patching file ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java patching file ql/src/java/org/apache/hadoop/hive/ql/optimizer/SimpleFetchAggregation.java patching file ql/src/java/org/apache/hadoop/hive/ql/optimizer/SimpleFetchOptimizer.java patching file ql/src/java/org/apache/hadoop/hive/ql/parse/MapReduceCompiler.java patching file ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java Hunk #3 succeeded at 119 (offset 9 lines). Hunk #4 succeeded at 679 (offset 26 lines). patching file ql/src/java/org/apache/hadoop/hive/ql/parse/RowResolver.java patching file ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java Hunk #1 succeeded at 3503 (offset -19 lines). Hunk #2 succeeded at 3609 (offset -19 lines). Hunk #3 succeeded at 3622 (offset -19 lines). Hunk #4 succeeded at 3634 (offset -19 lines). Hunk #5 succeeded at 3684 (offset -19 lines). Hunk #6 succeeded at 3713 (offset -19 lines). Hunk #7 succeeded at 3820 (offset -19 lines). Hunk #8 succeeded at 6964 (offset -18 lines). Hunk #9 succeeded at 6990 (offset -18 lines). patching file ql/src/test/queries/clientpositive/fetch_aggregation.q patching file ql/src/test/results/clientpositive/fetch_aggregation.q.out patching file ql/src/test/results/compiler/plan/groupby1.q.xml Hunk #5 succeeded at 1312 (offset -10 lines). Hunk #6 succeeded at 1326 (offset -10 lines). Hunk #7 succeeded at 1345 (offset -10 lines). Hunk #8 succeeded at 1426 (offset -10 lines). Hunk #9 succeeded at 1478 (offset -10 lines). patching file ql/src/test/results/compiler/plan/groupby2.q.xml Hunk #10 succeeded at 1087 (offset -10 lines). Hunk #11 succeeded at 1428 (offset -10 lines). Hunk #12 succeeded at 1482 (offset -10 lines). Hunk #13 succeeded at 1508 (offset -10 lines). Hunk #14 succeeded at 1541 (offset -10 lines). Hunk #15 succeeded at 1618 (offset -10 lines). Hunk #16 succeeded at 1647 (offset -10 lines). Hunk #17 succeeded at 1715 (offset -10 lines). Hunk #18 succeeded at 1734 (offset -10 lines). Hunk #19 succeeded at 1819 (offset -10 lines). Hunk #20 succeeded at 1832 (offset -10 lines). patching file ql/src/test/results/compiler/plan/groupby3.q.xml Hunk #8 succeeded at 1299 (offset -7 lines). Hunk #9 succeeded at 1627 (offset -7 lines). Hunk #10 succeeded at 1640 (offset -7 lines). Hunk #11 succeeded at 1653 (offset -7 lines). Hunk #12 succeeded at 1695 (offset -7 lines). Hunk #13 succeeded at 1709 (offset -7 lines). Hunk #14 succeeded at 1723 (offset -7 lines). Hunk #15 succeeded at 1770 (offset -7 lines). Hunk #16 succeeded at 1846 (offset -7 lines). Hunk #17 succeeded at 1859 (offset -7 lines). Hunk #18 succeeded at 1872 (offset -7 lines). Hunk #19 succeeded at 1938 (offset -7 lines). Hunk #20 succeeded at 2144 (offset -7 lines). Hunk #21 succeeded at 2157 (offset -7 lines). Hunk #22 succeeded at 2170 (offset -7 lines). patching file ql/src/test/results/compiler/plan/groupby5.q.xml Hunk #5 succeeded at 1175 (offset -10 lines). Hunk #6 succeeded at 1189 (offset -10 lines). Hunk #7 succeeded at 1208 (offset -10 lines). Hunk #8 succeeded at 1295 (offset -10 lines). Hunk #9 succeeded at 1347 (offset -10 lines). patching file serde/src/java/org/apache/hadoop/hive/serde2/SerDeUtils.java {quote} THis did not patch perfectly clean. Running test now manually. Fetch task aggregation for simple group by query Key: HIVE-4002 URL: https://issues.apache.org/jira/browse/HIVE-4002 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Navis Assignee: Navis Priority: Minor Attachments: HIVE-4002.D8739.1.patch, HIVE-4002.D8739.2.patch, HIVE-4002.D8739.3.patch Aggregation queries with no group-by clause (for example, select count(*) from src) executes final aggregation in single reduce task. But it's too small even for single reducer because the most of UDAF
[jira] [Commented] (HIVE-4002) Fetch task aggregation for simple group by query
[ https://issues.apache.org/jira/browse/HIVE-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13749766#comment-13749766 ] Yin Huai commented on HIVE-4002: [~appodictic] Sorry for jumping in late. Seems changes in DemuxOperator and MuxOperator will break plans optimized by Correlation Optimizer. Let me take a look and leave my comments on phabricator. Fetch task aggregation for simple group by query Key: HIVE-4002 URL: https://issues.apache.org/jira/browse/HIVE-4002 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Navis Assignee: Navis Priority: Minor Attachments: HIVE-4002.D8739.1.patch, HIVE-4002.D8739.2.patch, HIVE-4002.D8739.3.patch Aggregation queries with no group-by clause (for example, select count(*) from src) executes final aggregation in single reduce task. But it's too small even for single reducer because the most of UDAF generates just single row for map aggregation. If final fetch task can aggregate outputs from map tasks, shuffling time can be removed. This optimization transforms operator tree something like, TS-FIL-SEL-GBY1-RS-GBY2-SEL-FS + FETCH-TASK into TS-FIL-SEL-GBY1-FS + FETCH-TASK(GBY2-SEL-LS) With the patch, time taken for auto_join_filters.q test reduced to 6 min (10 min, before). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4002) Fetch task aggregation for simple group by query
[ https://issues.apache.org/jira/browse/HIVE-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13748576#comment-13748576 ] Edward Capriolo commented on HIVE-4002: --- +1 this is a very exciting feature. Will commit when tests pass. Fetch task aggregation for simple group by query Key: HIVE-4002 URL: https://issues.apache.org/jira/browse/HIVE-4002 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Navis Assignee: Navis Priority: Minor Attachments: HIVE-4002.D8739.1.patch, HIVE-4002.D8739.2.patch, HIVE-4002.D8739.3.patch Aggregation queries with no group-by clause (for example, select count(*) from src) executes final aggregation in single reduce task. But it's too small even for single reducer because the most of UDAF generates just single row for map aggregation. If final fetch task can aggregate outputs from map tasks, shuffling time can be removed. This optimization transforms operator tree something like, TS-FIL-SEL-GBY1-RS-GBY2-SEL-FS + FETCH-TASK into TS-FIL-SEL-GBY1-FS + FETCH-TASK(GBY2-SEL-LS) With the patch, time taken for auto_join_filters.q test reduced to 6 min (10 min, before). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4002) Fetch task aggregation for simple group by query
[ https://issues.apache.org/jira/browse/HIVE-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723315#comment-13723315 ] Edward Capriolo commented on HIVE-4002: --- [~navis]Sorry I dropped the ball on this review. Can you rebase? Fetch task aggregation for simple group by query Key: HIVE-4002 URL: https://issues.apache.org/jira/browse/HIVE-4002 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Navis Assignee: Navis Priority: Minor Attachments: HIVE-4002.D8739.1.patch, HIVE-4002.D8739.2.patch Aggregation queries with no group-by clause (for example, select count(*) from src) executes final aggregation in single reduce task. But it's too small even for single reducer because the most of UDAF generates just single row for map aggregation. If final fetch task can aggregate outputs from map tasks, shuffling time can be removed. This optimization transforms operator tree something like, TS-FIL-SEL-GBY1-RS-GBY2-SEL-FS + FETCH-TASK into TS-FIL-SEL-GBY1-FS + FETCH-TASK(GBY2-SEL-LS) With the patch, time taken for auto_join_filters.q test reduced to 6 min (10 min, before). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4002) Fetch task aggregation for simple group by query
[ https://issues.apache.org/jira/browse/HIVE-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13702865#comment-13702865 ] Navis commented on HIVE-4002: - Yes, some threshold might be more useful. Fetch task aggregation for simple group by query Key: HIVE-4002 URL: https://issues.apache.org/jira/browse/HIVE-4002 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Navis Assignee: Navis Priority: Minor Attachments: HIVE-4002.D8739.1.patch, HIVE-4002.D8739.2.patch Aggregation queries with no group-by clause (for example, select count(*) from src) executes final aggregation in single reduce task. But it's too small even for single reducer because the most of UDAF generates just single row for map aggregation. If final fetch task can aggregate outputs from map tasks, shuffling time can be removed. This optimization transforms operator tree something like, TS-FIL-SEL-GBY1-RS-GBY2-SEL-FS + FETCH-TASK into TS-FIL-SEL-GBY1-FS + FETCH-TASK(GBY2-SEL-LS) With the patch, time taken for auto_join_filters.q test reduced to 6 min (10 min, before). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4002) Fetch task aggregation for simple group by query
[ https://issues.apache.org/jira/browse/HIVE-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13702867#comment-13702867 ] Edward Capriolo commented on HIVE-4002: --- Testing now. The threshold can be a follow on. I will do a more critical review in the next couple of days. Fetch task aggregation for simple group by query Key: HIVE-4002 URL: https://issues.apache.org/jira/browse/HIVE-4002 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Navis Assignee: Navis Priority: Minor Attachments: HIVE-4002.D8739.1.patch, HIVE-4002.D8739.2.patch Aggregation queries with no group-by clause (for example, select count(*) from src) executes final aggregation in single reduce task. But it's too small even for single reducer because the most of UDAF generates just single row for map aggregation. If final fetch task can aggregate outputs from map tasks, shuffling time can be removed. This optimization transforms operator tree something like, TS-FIL-SEL-GBY1-RS-GBY2-SEL-FS + FETCH-TASK into TS-FIL-SEL-GBY1-FS + FETCH-TASK(GBY2-SEL-LS) With the patch, time taken for auto_join_filters.q test reduced to 6 min (10 min, before). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4002) Fetch task aggregation for simple group by query
[ https://issues.apache.org/jira/browse/HIVE-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701661#comment-13701661 ] Edward Capriolo commented on HIVE-4002: --- This is a nice feature. There are times when I know that count(distinct(col)) and other operations like the one you have requested produce small result sets and the shuffle is the bottleneck. I do like this feature but turning it on manually is cumbersome for the end user. I wonder if we can convert the last step at runtime somehow.(Probably not easily but that would be nice) Fetch task aggregation for simple group by query Key: HIVE-4002 URL: https://issues.apache.org/jira/browse/HIVE-4002 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Navis Assignee: Navis Priority: Minor Attachments: HIVE-4002.D8739.1.patch Aggregation queries with no group-by clause (for example, select count(*) from src) executes final aggregation in single reduce task. But it's too small even for single reducer because the most of UDAF generates just single row for map aggregation. If final fetch task can aggregate outputs from map tasks, shuffling time can be removed. This optimization transforms operator tree something like, TS-FIL-SEL-GBY1-RS-GBY2-SEL-FS + FETCH-TASK into TS-FIL-SEL-GBY1-FS + FETCH-TASK(GBY2-SEL-LS) With the patch, time taken for auto_join_filters.q test reduced to 6 min (10 min, before). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira