[jira] [Commented] (HIVE-4809) ReduceSinkOperator of PTFOperator can have redundant key columns
[ https://issues.apache.org/jira/browse/HIVE-4809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14282246#comment-14282246 ] Ashutosh Chauhan commented on HIVE-4809: +1 ReduceSinkOperator of PTFOperator can have redundant key columns Key: HIVE-4809 URL: https://issues.apache.org/jira/browse/HIVE-4809 Project: Hive Issue Type: Improvement Components: PTF-Windowing Affects Versions: 0.11.0 Reporter: Yin Huai Assignee: Navis Attachments: HIVE-4809.1.patch.txt For example, we have a simple query like this ... {code:sql} SELECT x.a, x.b, count(x.b) OVER (PARTITION BY x.a) FROM src x; {\code} The plan of it is ... {code} STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias - Map Operator Tree: x TableScan alias: x Reduce Output Operator key expressions: expr: a type: int expr: a type: int sort order: ++ Map-reduce partition columns: expr: a type: int tag: -1 value expressions: expr: a type: int expr: b type: string Reduce Operator Tree: Extract PTF Operator Select Operator expressions: expr: _col0 type: int expr: _col1 type: string expr: _wcol0 type: bigint outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 {\code} The ReduceSinkOperator has two a in its key columns. This redundancy can increase the size of map output. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-4809) ReduceSinkOperator of PTFOperator can have redundant key columns
[ https://issues.apache.org/jira/browse/HIVE-4809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281680#comment-14281680 ] Ashutosh Chauhan commented on HIVE-4809: Can you create a RB for this ? ReduceSinkOperator of PTFOperator can have redundant key columns Key: HIVE-4809 URL: https://issues.apache.org/jira/browse/HIVE-4809 Project: Hive Issue Type: Improvement Components: PTF-Windowing Affects Versions: 0.11.0 Reporter: Yin Huai Assignee: Navis Attachments: HIVE-4809.1.patch.txt For example, we have a simple query like this ... {code:sql} SELECT x.a, x.b, count(x.b) OVER (PARTITION BY x.a) FROM src x; {\code} The plan of it is ... {code} STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias - Map Operator Tree: x TableScan alias: x Reduce Output Operator key expressions: expr: a type: int expr: a type: int sort order: ++ Map-reduce partition columns: expr: a type: int tag: -1 value expressions: expr: a type: int expr: b type: string Reduce Operator Tree: Extract PTF Operator Select Operator expressions: expr: _col0 type: int expr: _col1 type: string expr: _wcol0 type: bigint outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 {\code} The ReduceSinkOperator has two a in its key columns. This redundancy can increase the size of map output. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-4809) ReduceSinkOperator of PTFOperator can have redundant key columns
[ https://issues.apache.org/jira/browse/HIVE-4809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14281613#comment-14281613 ] Hive QA commented on HIVE-4809: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12692892/HIVE-4809.1.patch.txt {color:red}ERROR:{color} -1 due to 13 failed/errored test(s), 7231 tests executed *Failed tests:* {noformat} TestMiniTezCliDriver-script_pipe.q-insert_values_non_partitioned.q-insert_update_delete.q-and-12-more - did not produce a TEST-*.xml file TestMiniTezCliDriver-scriptfile1.q-union2.q-vectorized_bucketmapjoin1.q-and-12-more - did not produce a TEST-*.xml file TestMiniTezCliDriver-vector_decimal_10_0.q-vector_decimal_trailing.q-lvj_mapjoin.q-and-12-more - did not produce a TEST-*.xml file TestMiniTezCliDriver-vector_partitioned_date_time.q-vector_non_string_partition.q-tez_dml.q-and-12-more - did not produce a TEST-*.xml file TestMinimrCliDriver-infer_bucket_sort_map_operators.q-join1.q-bucketmapjoin7.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-infer_bucket_sort_num_buckets.q-disable_merge_for_bucketing.q-uber_reduce.q-and-1-more - did not produce a TEST-*.xml file TestMinimrCliDriver-leftsemijoin_mr.q-bucket5.q-root_dir_external_table.q-and-1-more - did not produce a TEST-*.xml file TestNegativeMinimrCliDriver-mapreduce_stack_trace_hadoop20.q - did not produce a TEST-*.xml file TestNegativeMinimrCliDriver-udf_local_resource.q-mapreduce_stack_trace_turnoff_hadoop20.q-mapreduce_stack_trace.q-and-5-more - did not produce a TEST-*.xml file org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udaf_histogram_numeric org.apache.hadoop.hive.ql.TestMTQueries.testMTQueries1 org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler.org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler org.apache.hive.hcatalog.streaming.TestStreaming.testTransactionBatchCommit_Json {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2409/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2409/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2409/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 13 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12692892 - PreCommit-HIVE-TRUNK-Build ReduceSinkOperator of PTFOperator can have redundant key columns Key: HIVE-4809 URL: https://issues.apache.org/jira/browse/HIVE-4809 Project: Hive Issue Type: Improvement Components: PTF-Windowing Affects Versions: 0.11.0 Reporter: Yin Huai Assignee: Navis Attachments: HIVE-4809.1.patch.txt For example, we have a simple query like this ... {code:sql} SELECT x.a, x.b, count(x.b) OVER (PARTITION BY x.a) FROM src x; {\code} The plan of it is ... {code} STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias - Map Operator Tree: x TableScan alias: x Reduce Output Operator key expressions: expr: a type: int expr: a type: int sort order: ++ Map-reduce partition columns: expr: a type: int tag: -1 value expressions: expr: a type: int expr: b type: string Reduce Operator Tree: Extract PTF Operator Select Operator expressions: expr: _col0 type: int expr: _col1 type: string expr: _wcol0 type: bigint outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 {\code} The ReduceSinkOperator has two a in its key columns. This redundancy can increase the size of map output. -- This message was sent by
[jira] [Commented] (HIVE-4809) ReduceSinkOperator of PTFOperator can have redundant key columns
[ https://issues.apache.org/jira/browse/HIVE-4809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13699294#comment-13699294 ] Yin Huai commented on HIVE-4809: For a OVER clause, we can have partitioning columns (specified by PARTITION BY) and ordering columns (specified by ORDER BY). In the current implementation, we use the key columns of ReduceSinkOperator (RS) to take care both grouping (for those partitioning columns) and ordering (for those ordering columns). So, we first add all partitioning columns and then add all ordering columns to the key columns of the RS. If we do not specify ordering columns, we will use partitioning columns as ordering columns. Seems we cannot completely remove those duplicate key columns right now (because key columns of RS need to take care both grouping and ordering). But, we can optimize certain cases. For example, if ordering columns are not specified, we do not assign those partition columns to ordering columns. ReduceSinkOperator of PTFOperator can have redundant key columns Key: HIVE-4809 URL: https://issues.apache.org/jira/browse/HIVE-4809 Project: Hive Issue Type: Improvement Reporter: Yin Huai Assignee: Yin Huai For example, we have a simple query like this ... {code:sql} SELECT x.a, x.b, count(x.b) OVER (PARTITION BY x.a) FROM src x; {\code} The plan of it is ... {code} STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias - Map Operator Tree: x TableScan alias: x Reduce Output Operator key expressions: expr: a type: int expr: a type: int sort order: ++ Map-reduce partition columns: expr: a type: int tag: -1 value expressions: expr: a type: int expr: b type: string Reduce Operator Tree: Extract PTF Operator Select Operator expressions: expr: _col0 type: int expr: _col1 type: string expr: _wcol0 type: bigint outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 {\code} The ReduceSinkOperator has two a in its key columns. This redundancy can increase the size of map output. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira