[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator
[ https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018563#comment-14018563 ] Hive QA commented on HIVE-4867: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12648427/HIVE-4867.5.patch.txt {color:red}ERROR:{color} -1 due to 43 failed/errored test(s), 5510 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_explain org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join17 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join20 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join21 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join22 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join28 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join29 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join30 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join31 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join8 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join_filters org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join_nulls org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join_without_localtask org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin3 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin5 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cross_product_check_2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_explain_rearrange org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join_filters_overlap org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join_reorder4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_mapjoin_filter_on_outerjoin org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_mapjoin_subquery2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_mapjoin_test_outer org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_multiMapJoin1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_multi_join_union org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_reduce_deduplicate_exclude_join org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample8 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats11 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_transform_ppr1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_transform_ppr2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_ppr org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorization_part org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorized_nested_mapjoin org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_root_dir_external_table org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_authorization_ctas org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_sortmerge_mapjoin_mismatch_1 org.apache.hadoop.hive.ql.exec.tez.TestTezTask.testSubmit org.apache.hive.hcatalog.pig.TestOrcHCatPigStorer.testWriteDecimal org.apache.hive.hcatalog.pig.TestOrcHCatPigStorer.testWriteDecimalX org.apache.hive.hcatalog.pig.TestOrcHCatPigStorer.testWriteDecimalXY {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/390/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/390/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-Build-390/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 43 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12648427 > Deduplicate columns appearing in both the key list and value list of > ReduceSinkOperator > --- > > Key: HIVE-4867 > URL: https://issues.apache.org/jira/browse/HIVE-4867 > Project: Hive > Issue Type: Improvement >Reporter: Yin Huai >Assignee: Navis > Attachments: HIVE-4867.1.patch.txt, HIVE-4867.2.patch.txt, > HIVE-4867.3.patch.txt, HIVE-4867.4.patch.txt, HIVE-4867.5.patch.txt, > source_only.txt > > > A ReduceSinkOperator emits data in the format of keys and values. Right now, > a column may appear in both the k
[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator
[ https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018050#comment-14018050 ] Hive QA commented on HIVE-4867: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12648263/HIVE-4867.4.patch.txt {color:red}ERROR:{color} -1 due to 16 failed/errored test(s), 5585 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample8 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_transform_ppr1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_transform_ppr2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_ppr org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_load_dyn_part1 org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_metadata_only_queries org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_ptf org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_scriptfile1 org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_tez_schema_evolution org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_root_dir_external_table org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_authorization_ctas org.apache.hadoop.hive.ql.exec.tez.TestTezTask.testSubmit org.apache.hive.hcatalog.pig.TestOrcHCatPigStorer.testWriteDecimal org.apache.hive.hcatalog.pig.TestOrcHCatPigStorer.testWriteDecimalX org.apache.hive.hcatalog.pig.TestOrcHCatPigStorer.testWriteDecimalXY org.apache.hive.jdbc.miniHS2.TestHiveServer2.testConnection {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/388/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/388/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-Build-388/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 16 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12648263 > Deduplicate columns appearing in both the key list and value list of > ReduceSinkOperator > --- > > Key: HIVE-4867 > URL: https://issues.apache.org/jira/browse/HIVE-4867 > Project: Hive > Issue Type: Improvement >Reporter: Yin Huai >Assignee: Navis > Attachments: HIVE-4867.1.patch.txt, HIVE-4867.2.patch.txt, > HIVE-4867.3.patch.txt, HIVE-4867.4.patch.txt, source_only.txt > > > A ReduceSinkOperator emits data in the format of keys and values. Right now, > a column may appear in both the key list and value list, which result in > unnecessary overhead for shuffling. > Example: > We have a query shown below ... > {code:sql} > explain select ss_ticket_number from store_sales cluster by ss_ticket_number; > {\code} > The plan is ... > {code} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > store_sales > TableScan > alias: store_sales > Select Operator > expressions: > expr: ss_ticket_number > type: int > outputColumnNames: _col0 > Reduce Output Operator > key expressions: > expr: _col0 > type: int > sort order: + > Map-reduce partition columns: > expr: _col0 > type: int > tag: -1 > value expressions: > expr: _col0 > type: int > Reduce Operator Tree: > Extract > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Stage: Stage-0 > Fetch Operator > limit: -1 > {\code} > The column 'ss_ticket_number' is in both the key list and value list of the > ReduceSinkOperator. The type of ss_ticket_number is int. For this case, > BinarySortableSerDe will introduce 1 byte more for every int in the key. > LazyBinarySerDe will also introduce overhead when recording the length of a > int. For every int, 10 bytes should be a rough estimation of the size of data > emitted from the Map phase. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator
[ https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14017443#comment-14017443 ] Ashutosh Chauhan commented on HIVE-4867: +1 > Deduplicate columns appearing in both the key list and value list of > ReduceSinkOperator > --- > > Key: HIVE-4867 > URL: https://issues.apache.org/jira/browse/HIVE-4867 > Project: Hive > Issue Type: Improvement >Reporter: Yin Huai >Assignee: Navis > Attachments: HIVE-4867.1.patch.txt, HIVE-4867.2.patch.txt, > HIVE-4867.3.patch.txt, HIVE-4867.4.patch.txt, source_only.txt > > > A ReduceSinkOperator emits data in the format of keys and values. Right now, > a column may appear in both the key list and value list, which result in > unnecessary overhead for shuffling. > Example: > We have a query shown below ... > {code:sql} > explain select ss_ticket_number from store_sales cluster by ss_ticket_number; > {\code} > The plan is ... > {code} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > store_sales > TableScan > alias: store_sales > Select Operator > expressions: > expr: ss_ticket_number > type: int > outputColumnNames: _col0 > Reduce Output Operator > key expressions: > expr: _col0 > type: int > sort order: + > Map-reduce partition columns: > expr: _col0 > type: int > tag: -1 > value expressions: > expr: _col0 > type: int > Reduce Operator Tree: > Extract > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Stage: Stage-0 > Fetch Operator > limit: -1 > {\code} > The column 'ss_ticket_number' is in both the key list and value list of the > ReduceSinkOperator. The type of ss_ticket_number is int. For this case, > BinarySortableSerDe will introduce 1 byte more for every int in the key. > LazyBinarySerDe will also introduce overhead when recording the length of a > int. For every int, 10 bytes should be a rough estimation of the size of data > emitted from the Map phase. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator
[ https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14016972#comment-14016972 ] Hive QA commented on HIVE-4867: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12648073/HIVE-4867.3.patch.txt {color:red}ERROR:{color} -1 due to 58 failed/errored test(s), 5510 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join_reordering_values org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucket2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_create_like_view org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ctas org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_display_colstats_tbllvl org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_filter_join_breaktask org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_map_ppr_multi_distinct org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_1_23 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_skew_1_23 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_input_part7 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join14 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join17 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join32 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join32_lessSize org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join33 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join9 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join_filters_overlap org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_fs_overwrite org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_louter_join_ppr org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge3 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_outer_join_ppr org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_pcr org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd_join_filter org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd_union_view org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd_vc org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_push_or org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_regexp_extract org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_router_join_ppr org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample10 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample6 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample8 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_show_create_table_serde org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_skewjoin_union_remove_1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_skewjoin_union_remove_2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_smb_mapjoin_13 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_smb_mapjoin_15 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_transform_ppr1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_transform_ppr2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udtf_explode org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union24 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_ppr org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_23 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_25 org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucket5 org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucketizedhiveinputformat org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucketmapjoin6 org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_disable_merge_for_bucketing org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_reduce_deduplicate org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_root_dir_external_table org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_authorization_ctas org.apache.hadoop.hive.ql.exec.tez.TestTezTask.testSubmit org.apache.hadoop.hive.ql.parse.TestParse.testParse_input20 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input4 org.apache.hadoop.hive.ql.parse.TestParse.testParse_input5 org.apache.hive.hcatalog.pig.TestHCatLoader.testReadDataPrimitiveTypes org.apache.hive.hcatalog.pig.TestOrcHCatPigStorer.testWriteDecimal org.apache.hive.hcatalog.pig.TestOrcHCatPigStorer.testWriteDecimalX org.apache.hive.hcatalog.pig.TestOrcHCatPigStorer.testWriteDecimalXY {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/379/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/379/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-Build-379/ Messages: {noformat} Executing org.apache.hive.ptest.exe
[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator
[ https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013156#comment-14013156 ] Navis commented on HIVE-4867: - Yes, there is a problem in mapjoin on tez. MR compiler replaces RS with HashSink made from value exprs of Join but Tez compiler uses RS as is state assuming it has same columns with value exprs of Join, which is not true with this patch. Need some more time to fix it. > Deduplicate columns appearing in both the key list and value list of > ReduceSinkOperator > --- > > Key: HIVE-4867 > URL: https://issues.apache.org/jira/browse/HIVE-4867 > Project: Hive > Issue Type: Improvement >Reporter: Yin Huai >Assignee: Navis > Attachments: HIVE-4867.1.patch.txt, HIVE-4867.2.patch.txt, > source_only.txt > > > A ReduceSinkOperator emits data in the format of keys and values. Right now, > a column may appear in both the key list and value list, which result in > unnecessary overhead for shuffling. > Example: > We have a query shown below ... > {code:sql} > explain select ss_ticket_number from store_sales cluster by ss_ticket_number; > {\code} > The plan is ... > {code} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > store_sales > TableScan > alias: store_sales > Select Operator > expressions: > expr: ss_ticket_number > type: int > outputColumnNames: _col0 > Reduce Output Operator > key expressions: > expr: _col0 > type: int > sort order: + > Map-reduce partition columns: > expr: _col0 > type: int > tag: -1 > value expressions: > expr: _col0 > type: int > Reduce Operator Tree: > Extract > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Stage: Stage-0 > Fetch Operator > limit: -1 > {\code} > The column 'ss_ticket_number' is in both the key list and value list of the > ReduceSinkOperator. The type of ss_ticket_number is int. For this case, > BinarySortableSerDe will introduce 1 byte more for every int in the key. > LazyBinarySerDe will also introduce overhead when recording the length of a > int. For every int, 10 bytes should be a rough estimation of the size of data > emitted from the Map phase. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator
[ https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013059#comment-14013059 ] Ashutosh Chauhan commented on HIVE-4867: Query output for tez tests mrr.q have changed in .2 patch. Not sure, if it was wrong before or after. > Deduplicate columns appearing in both the key list and value list of > ReduceSinkOperator > --- > > Key: HIVE-4867 > URL: https://issues.apache.org/jira/browse/HIVE-4867 > Project: Hive > Issue Type: Improvement >Reporter: Yin Huai >Assignee: Navis > Attachments: HIVE-4867.1.patch.txt, HIVE-4867.2.patch.txt, > source_only.txt > > > A ReduceSinkOperator emits data in the format of keys and values. Right now, > a column may appear in both the key list and value list, which result in > unnecessary overhead for shuffling. > Example: > We have a query shown below ... > {code:sql} > explain select ss_ticket_number from store_sales cluster by ss_ticket_number; > {\code} > The plan is ... > {code} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > store_sales > TableScan > alias: store_sales > Select Operator > expressions: > expr: ss_ticket_number > type: int > outputColumnNames: _col0 > Reduce Output Operator > key expressions: > expr: _col0 > type: int > sort order: + > Map-reduce partition columns: > expr: _col0 > type: int > tag: -1 > value expressions: > expr: _col0 > type: int > Reduce Operator Tree: > Extract > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Stage: Stage-0 > Fetch Operator > limit: -1 > {\code} > The column 'ss_ticket_number' is in both the key list and value list of the > ReduceSinkOperator. The type of ss_ticket_number is int. For this case, > BinarySortableSerDe will introduce 1 byte more for every int in the key. > LazyBinarySerDe will also introduce overhead when recording the length of a > int. For every int, 10 bytes should be a rough estimation of the size of data > emitted from the Map phase. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator
[ https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012082#comment-14012082 ] Navis commented on HIVE-4867: - Sure. And this is not related to HIVE-2597. I'll check that, too. > Deduplicate columns appearing in both the key list and value list of > ReduceSinkOperator > --- > > Key: HIVE-4867 > URL: https://issues.apache.org/jira/browse/HIVE-4867 > Project: Hive > Issue Type: Improvement >Reporter: Yin Huai >Assignee: Navis > Attachments: HIVE-4867.1.patch.txt, HIVE-4867.2.patch.txt, > source_only.txt > > > A ReduceSinkOperator emits data in the format of keys and values. Right now, > a column may appear in both the key list and value list, which result in > unnecessary overhead for shuffling. > Example: > We have a query shown below ... > {code:sql} > explain select ss_ticket_number from store_sales cluster by ss_ticket_number; > {\code} > The plan is ... > {code} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > store_sales > TableScan > alias: store_sales > Select Operator > expressions: > expr: ss_ticket_number > type: int > outputColumnNames: _col0 > Reduce Output Operator > key expressions: > expr: _col0 > type: int > sort order: + > Map-reduce partition columns: > expr: _col0 > type: int > tag: -1 > value expressions: > expr: _col0 > type: int > Reduce Operator Tree: > Extract > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Stage: Stage-0 > Fetch Operator > limit: -1 > {\code} > The column 'ss_ticket_number' is in both the key list and value list of the > ReduceSinkOperator. The type of ss_ticket_number is int. For this case, > BinarySortableSerDe will introduce 1 byte more for every int in the key. > LazyBinarySerDe will also introduce overhead when recording the length of a > int. For every int, 10 bytes should be a rough estimation of the size of data > emitted from the Map phase. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator
[ https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012064#comment-14012064 ] Ashutosh Chauhan commented on HIVE-4867: Can you create RB entry for this ? Also, will this fix HIVE-2597 ? > Deduplicate columns appearing in both the key list and value list of > ReduceSinkOperator > --- > > Key: HIVE-4867 > URL: https://issues.apache.org/jira/browse/HIVE-4867 > Project: Hive > Issue Type: Improvement >Reporter: Yin Huai >Assignee: Navis > Attachments: HIVE-4867.1.patch.txt, HIVE-4867.2.patch.txt, > source_only.txt > > > A ReduceSinkOperator emits data in the format of keys and values. Right now, > a column may appear in both the key list and value list, which result in > unnecessary overhead for shuffling. > Example: > We have a query shown below ... > {code:sql} > explain select ss_ticket_number from store_sales cluster by ss_ticket_number; > {\code} > The plan is ... > {code} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > store_sales > TableScan > alias: store_sales > Select Operator > expressions: > expr: ss_ticket_number > type: int > outputColumnNames: _col0 > Reduce Output Operator > key expressions: > expr: _col0 > type: int > sort order: + > Map-reduce partition columns: > expr: _col0 > type: int > tag: -1 > value expressions: > expr: _col0 > type: int > Reduce Operator Tree: > Extract > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Stage: Stage-0 > Fetch Operator > limit: -1 > {\code} > The column 'ss_ticket_number' is in both the key list and value list of the > ReduceSinkOperator. The type of ss_ticket_number is int. For this case, > BinarySortableSerDe will introduce 1 byte more for every int in the key. > LazyBinarySerDe will also introduce overhead when recording the length of a > int. For every int, 10 bytes should be a rough estimation of the size of data > emitted from the Map phase. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator
[ https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005560#comment-14005560 ] Navis commented on HIVE-4867: - Waiting on HIVE-7087 to be committed first. > Deduplicate columns appearing in both the key list and value list of > ReduceSinkOperator > --- > > Key: HIVE-4867 > URL: https://issues.apache.org/jira/browse/HIVE-4867 > Project: Hive > Issue Type: Improvement >Reporter: Yin Huai >Assignee: Navis > Attachments: HIVE-4867.1.patch.txt, source_only.txt > > > A ReduceSinkOperator emits data in the format of keys and values. Right now, > a column may appear in both the key list and value list, which result in > unnecessary overhead for shuffling. > Example: > We have a query shown below ... > {code:sql} > explain select ss_ticket_number from store_sales cluster by ss_ticket_number; > {\code} > The plan is ... > {code} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > store_sales > TableScan > alias: store_sales > Select Operator > expressions: > expr: ss_ticket_number > type: int > outputColumnNames: _col0 > Reduce Output Operator > key expressions: > expr: _col0 > type: int > sort order: + > Map-reduce partition columns: > expr: _col0 > type: int > tag: -1 > value expressions: > expr: _col0 > type: int > Reduce Operator Tree: > Extract > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Stage: Stage-0 > Fetch Operator > limit: -1 > {\code} > The column 'ss_ticket_number' is in both the key list and value list of the > ReduceSinkOperator. The type of ss_ticket_number is int. For this case, > BinarySortableSerDe will introduce 1 byte more for every int in the key. > LazyBinarySerDe will also introduce overhead when recording the length of a > int. For every int, 10 bytes should be a rough estimation of the size of data > emitted from the Map phase. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator
[ https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001416#comment-14001416 ] Navis commented on HIVE-4867: - I think the patch is almost ready. But the diff file cannot be attached here(bigger than 10MB). The most part of change is from removing duplicated lineage information. So I'm thinking of fixing that first. > Deduplicate columns appearing in both the key list and value list of > ReduceSinkOperator > --- > > Key: HIVE-4867 > URL: https://issues.apache.org/jira/browse/HIVE-4867 > Project: Hive > Issue Type: Improvement >Reporter: Yin Huai >Assignee: Navis > Attachments: HIVE-4867.1.patch.txt, source_only.txt > > > A ReduceSinkOperator emits data in the format of keys and values. Right now, > a column may appear in both the key list and value list, which result in > unnecessary overhead for shuffling. > Example: > We have a query shown below ... > {code:sql} > explain select ss_ticket_number from store_sales cluster by ss_ticket_number; > {\code} > The plan is ... > {code} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > store_sales > TableScan > alias: store_sales > Select Operator > expressions: > expr: ss_ticket_number > type: int > outputColumnNames: _col0 > Reduce Output Operator > key expressions: > expr: _col0 > type: int > sort order: + > Map-reduce partition columns: > expr: _col0 > type: int > tag: -1 > value expressions: > expr: _col0 > type: int > Reduce Operator Tree: > Extract > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Stage: Stage-0 > Fetch Operator > limit: -1 > {\code} > The column 'ss_ticket_number' is in both the key list and value list of the > ReduceSinkOperator. The type of ss_ticket_number is int. For this case, > BinarySortableSerDe will introduce 1 byte more for every int in the key. > LazyBinarySerDe will also introduce overhead when recording the length of a > int. For every int, 10 bytes should be a rough estimation of the size of data > emitted from the Map phase. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator
[ https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711619#comment-13711619 ] Yin Huai commented on HIVE-4867: Assign to me first. If anyone wants to work on it, feel free to take it. > Deduplicate columns appearing in both the key list and value list of > ReduceSinkOperator > --- > > Key: HIVE-4867 > URL: https://issues.apache.org/jira/browse/HIVE-4867 > Project: Hive > Issue Type: Improvement >Reporter: Yin Huai >Assignee: Yin Huai > > A ReduceSinkOperator emits data in the format of keys and values. Right now, > a column may appear in both the key list and value list, which result in > unnecessary overhead for shuffling. > Example: > We have a query shown below ... > {code:sql} > explain select ss_ticket_number from store_sales cluster by ss_ticket_number; > {\code} > The plan is ... > {code} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > store_sales > TableScan > alias: store_sales > Select Operator > expressions: > expr: ss_ticket_number > type: int > outputColumnNames: _col0 > Reduce Output Operator > key expressions: > expr: _col0 > type: int > sort order: + > Map-reduce partition columns: > expr: _col0 > type: int > tag: -1 > value expressions: > expr: _col0 > type: int > Reduce Operator Tree: > Extract > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Stage: Stage-0 > Fetch Operator > limit: -1 > {\code} > The column 'ss_ticket_number' is in both the key list and value list of the > ReduceSinkOperator. The type of ss_ticket_number is int. For this case, > BinarySortableSerDe will introduce 1 byte more for every int in the key. > LazyBinarySerDe will also introduce overhead when recording the length of a > int. For every int, 10 bytes should be a rough estimation of the size of data > emitted from the Map phase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira