[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator

2014-06-05 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018563#comment-14018563
 ] 

Hive QA commented on HIVE-4867:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12648427/HIVE-4867.5.patch.txt

{color:red}ERROR:{color} -1 due to 43 failed/errored test(s), 5510 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_explain
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join17
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join20
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join21
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join22
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join28
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join29
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join30
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join31
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join8
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join_filters
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join_nulls
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join_without_localtask
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin3
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin4
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin5
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer4
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cross_product_check_2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_explain_rearrange
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join_filters_overlap
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join_reorder4
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_mapjoin_filter_on_outerjoin
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_mapjoin_subquery2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_mapjoin_test_outer
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_multiMapJoin1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_multi_join_union
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_reduce_deduplicate_exclude_join
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample8
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats11
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_transform_ppr1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_transform_ppr2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_ppr
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorization_part
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_vectorized_nested_mapjoin
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_root_dir_external_table
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_authorization_ctas
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_sortmerge_mapjoin_mismatch_1
org.apache.hadoop.hive.ql.exec.tez.TestTezTask.testSubmit
org.apache.hive.hcatalog.pig.TestOrcHCatPigStorer.testWriteDecimal
org.apache.hive.hcatalog.pig.TestOrcHCatPigStorer.testWriteDecimalX
org.apache.hive.hcatalog.pig.TestOrcHCatPigStorer.testWriteDecimalXY
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/390/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/390/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-Build-390/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 43 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12648427

> Deduplicate columns appearing in both the key list and value list of 
> ReduceSinkOperator
> ---
>
> Key: HIVE-4867
> URL: https://issues.apache.org/jira/browse/HIVE-4867
> Project: Hive
>  Issue Type: Improvement
>Reporter: Yin Huai
>Assignee: Navis
> Attachments: HIVE-4867.1.patch.txt, HIVE-4867.2.patch.txt, 
> HIVE-4867.3.patch.txt, HIVE-4867.4.patch.txt, HIVE-4867.5.patch.txt, 
> source_only.txt
>
>
> A ReduceSinkOperator emits data in the format of keys and values. Right now, 
> a column may appear in both the k

[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator

2014-06-04 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018050#comment-14018050
 ] 

Hive QA commented on HIVE-4867:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12648263/HIVE-4867.4.patch.txt

{color:red}ERROR:{color} -1 due to 16 failed/errored test(s), 5585 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample8
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_transform_ppr1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_transform_ppr2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_ppr
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_load_dyn_part1
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_metadata_only_queries
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_ptf
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_scriptfile1
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_tez_schema_evolution
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_root_dir_external_table
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_authorization_ctas
org.apache.hadoop.hive.ql.exec.tez.TestTezTask.testSubmit
org.apache.hive.hcatalog.pig.TestOrcHCatPigStorer.testWriteDecimal
org.apache.hive.hcatalog.pig.TestOrcHCatPigStorer.testWriteDecimalX
org.apache.hive.hcatalog.pig.TestOrcHCatPigStorer.testWriteDecimalXY
org.apache.hive.jdbc.miniHS2.TestHiveServer2.testConnection
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/388/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/388/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-Build-388/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 16 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12648263

> Deduplicate columns appearing in both the key list and value list of 
> ReduceSinkOperator
> ---
>
> Key: HIVE-4867
> URL: https://issues.apache.org/jira/browse/HIVE-4867
> Project: Hive
>  Issue Type: Improvement
>Reporter: Yin Huai
>Assignee: Navis
> Attachments: HIVE-4867.1.patch.txt, HIVE-4867.2.patch.txt, 
> HIVE-4867.3.patch.txt, HIVE-4867.4.patch.txt, source_only.txt
>
>
> A ReduceSinkOperator emits data in the format of keys and values. Right now, 
> a column may appear in both the key list and value list, which result in 
> unnecessary overhead for shuffling. 
> Example:
> We have a query shown below ...
> {code:sql}
> explain select ss_ticket_number from store_sales cluster by ss_ticket_number;
> {\code}
> The plan is ...
> {code}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Alias -> Map Operator Tree:
> store_sales 
>   TableScan
> alias: store_sales
> Select Operator
>   expressions:
> expr: ss_ticket_number
> type: int
>   outputColumnNames: _col0
>   Reduce Output Operator
> key expressions:
>   expr: _col0
>   type: int
> sort order: +
> Map-reduce partition columns:
>   expr: _col0
>   type: int
> tag: -1
> value expressions:
>   expr: _col0
>   type: int
>   Reduce Operator Tree:
> Extract
>   File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
> {\code}
> The column 'ss_ticket_number' is in both the key list and value list of the 
> ReduceSinkOperator. The type of ss_ticket_number is int. For this case, 
> BinarySortableSerDe will introduce 1 byte more for every int in the key. 
> LazyBinarySerDe will also introduce overhead when recording the length of a 
> int. For every int, 10 bytes should be a rough estimation of the size of data 
> emitted from the Map phase. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator

2014-06-04 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14017443#comment-14017443
 ] 

Ashutosh Chauhan commented on HIVE-4867:


+1

> Deduplicate columns appearing in both the key list and value list of 
> ReduceSinkOperator
> ---
>
> Key: HIVE-4867
> URL: https://issues.apache.org/jira/browse/HIVE-4867
> Project: Hive
>  Issue Type: Improvement
>Reporter: Yin Huai
>Assignee: Navis
> Attachments: HIVE-4867.1.patch.txt, HIVE-4867.2.patch.txt, 
> HIVE-4867.3.patch.txt, HIVE-4867.4.patch.txt, source_only.txt
>
>
> A ReduceSinkOperator emits data in the format of keys and values. Right now, 
> a column may appear in both the key list and value list, which result in 
> unnecessary overhead for shuffling. 
> Example:
> We have a query shown below ...
> {code:sql}
> explain select ss_ticket_number from store_sales cluster by ss_ticket_number;
> {\code}
> The plan is ...
> {code}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Alias -> Map Operator Tree:
> store_sales 
>   TableScan
> alias: store_sales
> Select Operator
>   expressions:
> expr: ss_ticket_number
> type: int
>   outputColumnNames: _col0
>   Reduce Output Operator
> key expressions:
>   expr: _col0
>   type: int
> sort order: +
> Map-reduce partition columns:
>   expr: _col0
>   type: int
> tag: -1
> value expressions:
>   expr: _col0
>   type: int
>   Reduce Operator Tree:
> Extract
>   File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
> {\code}
> The column 'ss_ticket_number' is in both the key list and value list of the 
> ReduceSinkOperator. The type of ss_ticket_number is int. For this case, 
> BinarySortableSerDe will introduce 1 byte more for every int in the key. 
> LazyBinarySerDe will also introduce overhead when recording the length of a 
> int. For every int, 10 bytes should be a rough estimation of the size of data 
> emitted from the Map phase. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator

2014-06-03 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14016972#comment-14016972
 ] 

Hive QA commented on HIVE-4867:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12648073/HIVE-4867.3.patch.txt

{color:red}ERROR:{color} -1 due to 58 failed/errored test(s), 5510 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join_reordering_values
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucket2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_create_like_view
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ctas
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_display_colstats_tbllvl
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_filter_join_breaktask
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_map_ppr_multi_distinct
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_1_23
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_skew_1_23
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_input_part7
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join14
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join17
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join32
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join32_lessSize
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join33
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join9
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join_filters_overlap
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_load_fs_overwrite
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_louter_join_ppr
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_merge3
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_outer_join_ppr
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_pcr
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd_join_filter
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd_union_view
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd_vc
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_push_or
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_regexp_extract
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_router_join_ppr
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample10
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample6
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_sample8
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_show_create_table_serde
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_skewjoin_union_remove_1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_skewjoin_union_remove_2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_smb_mapjoin_13
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_smb_mapjoin_15
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_transform_ppr1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_transform_ppr2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udtf_explode
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union24
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_ppr
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_23
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_25
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucket5
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucketizedhiveinputformat
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucketmapjoin6
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_disable_merge_for_bucketing
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_reduce_deduplicate
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_root_dir_external_table
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_authorization_ctas
org.apache.hadoop.hive.ql.exec.tez.TestTezTask.testSubmit
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input20
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input4
org.apache.hadoop.hive.ql.parse.TestParse.testParse_input5
org.apache.hive.hcatalog.pig.TestHCatLoader.testReadDataPrimitiveTypes
org.apache.hive.hcatalog.pig.TestOrcHCatPigStorer.testWriteDecimal
org.apache.hive.hcatalog.pig.TestOrcHCatPigStorer.testWriteDecimalX
org.apache.hive.hcatalog.pig.TestOrcHCatPigStorer.testWriteDecimalXY
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/379/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/379/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-Build-379/

Messages:
{noformat}
Executing org.apache.hive.ptest.exe

[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator

2014-05-29 Thread Navis (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013156#comment-14013156
 ] 

Navis commented on HIVE-4867:
-

Yes, there is a problem in mapjoin on tez. MR compiler replaces RS with 
HashSink made from value exprs of Join but Tez compiler uses RS as is state 
assuming it has same columns with value exprs of Join, which is not true with 
this patch. Need some more time to fix it.

> Deduplicate columns appearing in both the key list and value list of 
> ReduceSinkOperator
> ---
>
> Key: HIVE-4867
> URL: https://issues.apache.org/jira/browse/HIVE-4867
> Project: Hive
>  Issue Type: Improvement
>Reporter: Yin Huai
>Assignee: Navis
> Attachments: HIVE-4867.1.patch.txt, HIVE-4867.2.patch.txt, 
> source_only.txt
>
>
> A ReduceSinkOperator emits data in the format of keys and values. Right now, 
> a column may appear in both the key list and value list, which result in 
> unnecessary overhead for shuffling. 
> Example:
> We have a query shown below ...
> {code:sql}
> explain select ss_ticket_number from store_sales cluster by ss_ticket_number;
> {\code}
> The plan is ...
> {code}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Alias -> Map Operator Tree:
> store_sales 
>   TableScan
> alias: store_sales
> Select Operator
>   expressions:
> expr: ss_ticket_number
> type: int
>   outputColumnNames: _col0
>   Reduce Output Operator
> key expressions:
>   expr: _col0
>   type: int
> sort order: +
> Map-reduce partition columns:
>   expr: _col0
>   type: int
> tag: -1
> value expressions:
>   expr: _col0
>   type: int
>   Reduce Operator Tree:
> Extract
>   File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
> {\code}
> The column 'ss_ticket_number' is in both the key list and value list of the 
> ReduceSinkOperator. The type of ss_ticket_number is int. For this case, 
> BinarySortableSerDe will introduce 1 byte more for every int in the key. 
> LazyBinarySerDe will also introduce overhead when recording the length of a 
> int. For every int, 10 bytes should be a rough estimation of the size of data 
> emitted from the Map phase. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator

2014-05-29 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013059#comment-14013059
 ] 

Ashutosh Chauhan commented on HIVE-4867:


Query output for tez tests mrr.q have changed in .2 patch. Not sure, if it was 
wrong before or after.

> Deduplicate columns appearing in both the key list and value list of 
> ReduceSinkOperator
> ---
>
> Key: HIVE-4867
> URL: https://issues.apache.org/jira/browse/HIVE-4867
> Project: Hive
>  Issue Type: Improvement
>Reporter: Yin Huai
>Assignee: Navis
> Attachments: HIVE-4867.1.patch.txt, HIVE-4867.2.patch.txt, 
> source_only.txt
>
>
> A ReduceSinkOperator emits data in the format of keys and values. Right now, 
> a column may appear in both the key list and value list, which result in 
> unnecessary overhead for shuffling. 
> Example:
> We have a query shown below ...
> {code:sql}
> explain select ss_ticket_number from store_sales cluster by ss_ticket_number;
> {\code}
> The plan is ...
> {code}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Alias -> Map Operator Tree:
> store_sales 
>   TableScan
> alias: store_sales
> Select Operator
>   expressions:
> expr: ss_ticket_number
> type: int
>   outputColumnNames: _col0
>   Reduce Output Operator
> key expressions:
>   expr: _col0
>   type: int
> sort order: +
> Map-reduce partition columns:
>   expr: _col0
>   type: int
> tag: -1
> value expressions:
>   expr: _col0
>   type: int
>   Reduce Operator Tree:
> Extract
>   File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
> {\code}
> The column 'ss_ticket_number' is in both the key list and value list of the 
> ReduceSinkOperator. The type of ss_ticket_number is int. For this case, 
> BinarySortableSerDe will introduce 1 byte more for every int in the key. 
> LazyBinarySerDe will also introduce overhead when recording the length of a 
> int. For every int, 10 bytes should be a rough estimation of the size of data 
> emitted from the Map phase. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator

2014-05-28 Thread Navis (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012082#comment-14012082
 ] 

Navis commented on HIVE-4867:
-

Sure. And this is not related to HIVE-2597. I'll check that, too.

> Deduplicate columns appearing in both the key list and value list of 
> ReduceSinkOperator
> ---
>
> Key: HIVE-4867
> URL: https://issues.apache.org/jira/browse/HIVE-4867
> Project: Hive
>  Issue Type: Improvement
>Reporter: Yin Huai
>Assignee: Navis
> Attachments: HIVE-4867.1.patch.txt, HIVE-4867.2.patch.txt, 
> source_only.txt
>
>
> A ReduceSinkOperator emits data in the format of keys and values. Right now, 
> a column may appear in both the key list and value list, which result in 
> unnecessary overhead for shuffling. 
> Example:
> We have a query shown below ...
> {code:sql}
> explain select ss_ticket_number from store_sales cluster by ss_ticket_number;
> {\code}
> The plan is ...
> {code}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Alias -> Map Operator Tree:
> store_sales 
>   TableScan
> alias: store_sales
> Select Operator
>   expressions:
> expr: ss_ticket_number
> type: int
>   outputColumnNames: _col0
>   Reduce Output Operator
> key expressions:
>   expr: _col0
>   type: int
> sort order: +
> Map-reduce partition columns:
>   expr: _col0
>   type: int
> tag: -1
> value expressions:
>   expr: _col0
>   type: int
>   Reduce Operator Tree:
> Extract
>   File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
> {\code}
> The column 'ss_ticket_number' is in both the key list and value list of the 
> ReduceSinkOperator. The type of ss_ticket_number is int. For this case, 
> BinarySortableSerDe will introduce 1 byte more for every int in the key. 
> LazyBinarySerDe will also introduce overhead when recording the length of a 
> int. For every int, 10 bytes should be a rough estimation of the size of data 
> emitted from the Map phase. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator

2014-05-28 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14012064#comment-14012064
 ] 

Ashutosh Chauhan commented on HIVE-4867:


Can you create RB entry for this ? Also, will this fix HIVE-2597 ?

> Deduplicate columns appearing in both the key list and value list of 
> ReduceSinkOperator
> ---
>
> Key: HIVE-4867
> URL: https://issues.apache.org/jira/browse/HIVE-4867
> Project: Hive
>  Issue Type: Improvement
>Reporter: Yin Huai
>Assignee: Navis
> Attachments: HIVE-4867.1.patch.txt, HIVE-4867.2.patch.txt, 
> source_only.txt
>
>
> A ReduceSinkOperator emits data in the format of keys and values. Right now, 
> a column may appear in both the key list and value list, which result in 
> unnecessary overhead for shuffling. 
> Example:
> We have a query shown below ...
> {code:sql}
> explain select ss_ticket_number from store_sales cluster by ss_ticket_number;
> {\code}
> The plan is ...
> {code}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Alias -> Map Operator Tree:
> store_sales 
>   TableScan
> alias: store_sales
> Select Operator
>   expressions:
> expr: ss_ticket_number
> type: int
>   outputColumnNames: _col0
>   Reduce Output Operator
> key expressions:
>   expr: _col0
>   type: int
> sort order: +
> Map-reduce partition columns:
>   expr: _col0
>   type: int
> tag: -1
> value expressions:
>   expr: _col0
>   type: int
>   Reduce Operator Tree:
> Extract
>   File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
> {\code}
> The column 'ss_ticket_number' is in both the key list and value list of the 
> ReduceSinkOperator. The type of ss_ticket_number is int. For this case, 
> BinarySortableSerDe will introduce 1 byte more for every int in the key. 
> LazyBinarySerDe will also introduce overhead when recording the length of a 
> int. For every int, 10 bytes should be a rough estimation of the size of data 
> emitted from the Map phase. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator

2014-05-21 Thread Navis (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005560#comment-14005560
 ] 

Navis commented on HIVE-4867:
-

Waiting on HIVE-7087 to be committed first.

> Deduplicate columns appearing in both the key list and value list of 
> ReduceSinkOperator
> ---
>
> Key: HIVE-4867
> URL: https://issues.apache.org/jira/browse/HIVE-4867
> Project: Hive
>  Issue Type: Improvement
>Reporter: Yin Huai
>Assignee: Navis
> Attachments: HIVE-4867.1.patch.txt, source_only.txt
>
>
> A ReduceSinkOperator emits data in the format of keys and values. Right now, 
> a column may appear in both the key list and value list, which result in 
> unnecessary overhead for shuffling. 
> Example:
> We have a query shown below ...
> {code:sql}
> explain select ss_ticket_number from store_sales cluster by ss_ticket_number;
> {\code}
> The plan is ...
> {code}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Alias -> Map Operator Tree:
> store_sales 
>   TableScan
> alias: store_sales
> Select Operator
>   expressions:
> expr: ss_ticket_number
> type: int
>   outputColumnNames: _col0
>   Reduce Output Operator
> key expressions:
>   expr: _col0
>   type: int
> sort order: +
> Map-reduce partition columns:
>   expr: _col0
>   type: int
> tag: -1
> value expressions:
>   expr: _col0
>   type: int
>   Reduce Operator Tree:
> Extract
>   File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
> {\code}
> The column 'ss_ticket_number' is in both the key list and value list of the 
> ReduceSinkOperator. The type of ss_ticket_number is int. For this case, 
> BinarySortableSerDe will introduce 1 byte more for every int in the key. 
> LazyBinarySerDe will also introduce overhead when recording the length of a 
> int. For every int, 10 bytes should be a rough estimation of the size of data 
> emitted from the Map phase. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator

2014-05-18 Thread Navis (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001416#comment-14001416
 ] 

Navis commented on HIVE-4867:
-

I think the patch is almost ready. But the diff file cannot be attached 
here(bigger than 10MB). The most part of change is from removing duplicated 
lineage information. So I'm thinking of fixing that first.

> Deduplicate columns appearing in both the key list and value list of 
> ReduceSinkOperator
> ---
>
> Key: HIVE-4867
> URL: https://issues.apache.org/jira/browse/HIVE-4867
> Project: Hive
>  Issue Type: Improvement
>Reporter: Yin Huai
>Assignee: Navis
> Attachments: HIVE-4867.1.patch.txt, source_only.txt
>
>
> A ReduceSinkOperator emits data in the format of keys and values. Right now, 
> a column may appear in both the key list and value list, which result in 
> unnecessary overhead for shuffling. 
> Example:
> We have a query shown below ...
> {code:sql}
> explain select ss_ticket_number from store_sales cluster by ss_ticket_number;
> {\code}
> The plan is ...
> {code}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Alias -> Map Operator Tree:
> store_sales 
>   TableScan
> alias: store_sales
> Select Operator
>   expressions:
> expr: ss_ticket_number
> type: int
>   outputColumnNames: _col0
>   Reduce Output Operator
> key expressions:
>   expr: _col0
>   type: int
> sort order: +
> Map-reduce partition columns:
>   expr: _col0
>   type: int
> tag: -1
> value expressions:
>   expr: _col0
>   type: int
>   Reduce Operator Tree:
> Extract
>   File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
> {\code}
> The column 'ss_ticket_number' is in both the key list and value list of the 
> ReduceSinkOperator. The type of ss_ticket_number is int. For this case, 
> BinarySortableSerDe will introduce 1 byte more for every int in the key. 
> LazyBinarySerDe will also introduce overhead when recording the length of a 
> int. For every int, 10 bytes should be a rough estimation of the size of data 
> emitted from the Map phase. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator

2013-07-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711619#comment-13711619
 ] 

Yin Huai commented on HIVE-4867:


Assign to me first. If anyone wants to work on it, feel free to take it.

> Deduplicate columns appearing in both the key list and value list of 
> ReduceSinkOperator
> ---
>
> Key: HIVE-4867
> URL: https://issues.apache.org/jira/browse/HIVE-4867
> Project: Hive
>  Issue Type: Improvement
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> A ReduceSinkOperator emits data in the format of keys and values. Right now, 
> a column may appear in both the key list and value list, which result in 
> unnecessary overhead for shuffling. 
> Example:
> We have a query shown below ...
> {code:sql}
> explain select ss_ticket_number from store_sales cluster by ss_ticket_number;
> {\code}
> The plan is ...
> {code}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Alias -> Map Operator Tree:
> store_sales 
>   TableScan
> alias: store_sales
> Select Operator
>   expressions:
> expr: ss_ticket_number
> type: int
>   outputColumnNames: _col0
>   Reduce Output Operator
> key expressions:
>   expr: _col0
>   type: int
> sort order: +
> Map-reduce partition columns:
>   expr: _col0
>   type: int
> tag: -1
> value expressions:
>   expr: _col0
>   type: int
>   Reduce Operator Tree:
> Extract
>   File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
> {\code}
> The column 'ss_ticket_number' is in both the key list and value list of the 
> ReduceSinkOperator. The type of ss_ticket_number is int. For this case, 
> BinarySortableSerDe will introduce 1 byte more for every int in the key. 
> LazyBinarySerDe will also introduce overhead when recording the length of a 
> int. For every int, 10 bytes should be a rough estimation of the size of data 
> emitted from the Map phase. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira