[jira] [Updated] (HIVE-7633) Warehouse#getTablePath() doesn't handle external tables

2015-01-14 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-7633:
---
Component/s: Metastore

> Warehouse#getTablePath() doesn't handle external tables
> ---
>
> Key: HIVE-7633
> URL: https://issues.apache.org/jira/browse/HIVE-7633
> Project: Hive
>  Issue Type: Bug
>  Components: Metastore
>Affects Versions: 0.8.1, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0, 0.14.0, 
> 0.13.1
>Reporter: Joey Echeverria
>Priority: Critical
>
> Warehouse#getTablePath() takes a DB and a table name. This means it will 
> generate the wrong path for external tables. This can cause a problem if you 
> have an external table on the local file system and HDFS is not currently 
> running when trying to gather statistics.
> getTablePath() should take in the table and see if it's external and has a 
> location before just assuming it's a managed table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7633) Warehouse#getTablePath() doesn't handle external tables

2015-01-14 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277568#comment-14277568
 ] 

Yin Huai commented on HIVE-7633:


Changes in HIVE-1537 allows users to specify the location of a table. But, it 
did not change warehouse to correctly return the location of the table. 

> Warehouse#getTablePath() doesn't handle external tables
> ---
>
> Key: HIVE-7633
> URL: https://issues.apache.org/jira/browse/HIVE-7633
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.8.1, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0, 0.14.0, 
> 0.13.1
>Reporter: Joey Echeverria
>Priority: Critical
>
> Warehouse#getTablePath() takes a DB and a table name. This means it will 
> generate the wrong path for external tables. This can cause a problem if you 
> have an external table on the local file system and HDFS is not currently 
> running when trying to gather statistics.
> getTablePath() should take in the table and see if it's external and has a 
> location before just assuming it's a managed table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-7633) Warehouse#getTablePath() doesn't handle external tables

2015-01-14 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-7633:
---
Affects Version/s: 0.8.1
   0.9.0
   0.10.0
   0.11.0
   0.12.0
   0.14.0
   0.13.1

> Warehouse#getTablePath() doesn't handle external tables
> ---
>
> Key: HIVE-7633
> URL: https://issues.apache.org/jira/browse/HIVE-7633
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.8.1, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0, 0.14.0, 
> 0.13.1
>Reporter: Joey Echeverria
>Priority: Critical
>
> Warehouse#getTablePath() takes a DB and a table name. This means it will 
> generate the wrong path for external tables. This can cause a problem if you 
> have an external table on the local file system and HDFS is not currently 
> running when trying to gather statistics.
> getTablePath() should take in the table and see if it's external and has a 
> location before just assuming it's a managed table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HIVE-7633) Warehouse#getTablePath() doesn't handle external tables

2015-01-14 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-7633:
---
Priority: Critical  (was: Major)

> Warehouse#getTablePath() doesn't handle external tables
> ---
>
> Key: HIVE-7633
> URL: https://issues.apache.org/jira/browse/HIVE-7633
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Joey Echeverria
>Priority: Critical
>
> Warehouse#getTablePath() takes a DB and a table name. This means it will 
> generate the wrong path for external tables. This can cause a problem if you 
> have an external table on the local file system and HDFS is not currently 
> running when trying to gather statistics.
> getTablePath() should take in the table and see if it's external and has a 
> location before just assuming it's a managed table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-6137) Hive should report that the file/path doesn’t exist when it doesn’t

2015-01-14 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277228#comment-14277228
 ] 

Yin Huai commented on HIVE-6137:


[~hsubramaniyan] What is the affect version(s) of this bug?

> Hive should report that the file/path doesn’t exist when it doesn’t
> ---
>
> Key: HIVE-6137
> URL: https://issues.apache.org/jira/browse/HIVE-6137
> Project: Hive
>  Issue Type: Bug
>Reporter: Hari Sankar Sivarama Subramaniyan
>Assignee: Hari Sankar Sivarama Subramaniyan
> Attachments: HIVE-6137.1.patch, HIVE-6137.2.patch, HIVE-6137.3.patch, 
> HIVE-6137.4.patch, HIVE-6137.5.patch, HIVE-6137.6.patch
>
>
> Hive should report that the file/path doesn’t exist when it doesn’t (it now 
> reports SocketTimeoutException):
> Execute a Hive DDL query with a reference to a non-existent blob (such as 
> CREATE EXTERNAL TABLE...) and check Hive logs (stderr):
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got exception: 
> java.io.IOException)
> This error message is not detailed enough. If a file doesn't exist, Hive 
> should report that it received an error while trying to locate the file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-7205) Wrong results when union all of grouping followed by group by with correlation optimization

2014-10-16 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173804#comment-14173804
 ] 

Yin Huai commented on HIVE-7205:


[~navis] Can you update the review board? I will take a look. Thank you.

> Wrong results when union all of grouping followed by group by with 
> correlation optimization
> ---
>
> Key: HIVE-7205
> URL: https://issues.apache.org/jira/browse/HIVE-7205
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.13.0, 0.13.1
>Reporter: dima machlin
>Assignee: Navis
>Priority: Critical
> Attachments: HIVE-7205.1.patch.txt, HIVE-7205.2.patch.txt, 
> HIVE-7205.3.patch.txt, HIVE-7205.4.patch.txt
>
>
> use case :
> table TBL (a string,b string) contains single row : 'a','a'
> the following query :
> {code:sql}
> select b, sum(cc) from (
> select b,count(1) as cc from TBL group by b
> union all
> select a as b,count(1) as cc from TBL group by a
> ) z
> group by b
> {code}
> returns 
> a 1
> a 1
> while set hive.optimize.correlation=true;
> if we change set hive.optimize.correlation=false;
> it returns correct results : a 2
> The plan with correlation optimization :
> {code:sql}
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM 
> (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
> TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR 
> (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b (TOK_QUERY 
> (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION 
> (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) 
> (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL 
> a) z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT 
> (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum 
> (TOK_TABLE_OR_COL cc (TOK_GROUPBY (TOK_TABLE_OR_COL b
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Alias -> Map Operator Tree:
> null-subquery1:z-subquery1:TBL 
>   TableScan
> alias: TBL
> Select Operator
>   expressions:
> expr: b
> type: string
>   outputColumnNames: b
>   Group By Operator
> aggregations:
>   expr: count(1)
> bucketGroup: false
> keys:
>   expr: b
>   type: string
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
>   key expressions:
> expr: _col0
> type: string
>   sort order: +
>   Map-reduce partition columns:
> expr: _col0
> type: string
>   tag: 0
>   value expressions:
> expr: _col1
> type: bigint
> null-subquery2:z-subquery2:TBL 
>   TableScan
> alias: TBL
> Select Operator
>   expressions:
> expr: a
> type: string
>   outputColumnNames: a
>   Group By Operator
> aggregations:
>   expr: count(1)
> bucketGroup: false
> keys:
>   expr: a
>   type: string
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
>   key expressions:
> expr: _col0
> type: string
>   sort order: +
>   Map-reduce partition columns:
> expr: _col0
> type: string
>   tag: 1
>   value expressions:
> expr: _col1
> type: bigint
>   Reduce Operator Tree:
> Demux Operator
>   Group By Operator
> aggregations:
>   expr: count(VALUE._col0)
> bucketGroup: false
> keys:
>   expr: KEY._col0
>   type: string
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
>   expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
>   output

[jira] [Commented] (HIVE-7205) Wrong results when union all of grouping followed by group by with correlation optimization

2014-08-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093129#comment-14093129
 ] 

Yin Huai commented on HIVE-7205:


Yeah, fixing correctness bug is very important. 

However, the current patch also introduces a significant refactoring of the 
query evaluation path. I am not sure if this refactoring will not break other 
things. [~navis] Can you post a summary of how those operators work with your 
refactoring?

> Wrong results when union all of grouping followed by group by with 
> correlation optimization
> ---
>
> Key: HIVE-7205
> URL: https://issues.apache.org/jira/browse/HIVE-7205
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.13.0, 0.13.1
>Reporter: dima machlin
>Assignee: Navis
>Priority: Critical
> Attachments: HIVE-7205.1.patch.txt, HIVE-7205.2.patch.txt, 
> HIVE-7205.3.patch.txt
>
>
> use case :
> table TBL (a string,b string) contains single row : 'a','a'
> the following query :
> {code:sql}
> select b, sum(cc) from (
> select b,count(1) as cc from TBL group by b
> union all
> select a as b,count(1) as cc from TBL group by a
> ) z
> group by b
> {code}
> returns 
> a 1
> a 1
> while set hive.optimize.correlation=true;
> if we change set hive.optimize.correlation=false;
> it returns correct results : a 2
> The plan with correlation optimization :
> {code:sql}
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM 
> (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
> TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR 
> (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b (TOK_QUERY 
> (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION 
> (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) 
> (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL 
> a) z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT 
> (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum 
> (TOK_TABLE_OR_COL cc (TOK_GROUPBY (TOK_TABLE_OR_COL b
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Alias -> Map Operator Tree:
> null-subquery1:z-subquery1:TBL 
>   TableScan
> alias: TBL
> Select Operator
>   expressions:
> expr: b
> type: string
>   outputColumnNames: b
>   Group By Operator
> aggregations:
>   expr: count(1)
> bucketGroup: false
> keys:
>   expr: b
>   type: string
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
>   key expressions:
> expr: _col0
> type: string
>   sort order: +
>   Map-reduce partition columns:
> expr: _col0
> type: string
>   tag: 0
>   value expressions:
> expr: _col1
> type: bigint
> null-subquery2:z-subquery2:TBL 
>   TableScan
> alias: TBL
> Select Operator
>   expressions:
> expr: a
> type: string
>   outputColumnNames: a
>   Group By Operator
> aggregations:
>   expr: count(1)
> bucketGroup: false
> keys:
>   expr: a
>   type: string
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
>   key expressions:
> expr: _col0
> type: string
>   sort order: +
>   Map-reduce partition columns:
> expr: _col0
> type: string
>   tag: 1
>   value expressions:
> expr: _col1
> type: bigint
>   Reduce Operator Tree:
> Demux Operator
>   Group By Operator
> aggregations:
>   expr: count(VALUE._col0)
> bucketGroup: false
> keys:
>   expr: KEY._col0
>   type: string
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Sel

[jira] [Commented] (HIVE-7205) Wrong results when union all of grouping followed by group by with correlation optimization

2014-08-03 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14084215#comment-14084215
 ] 

Yin Huai commented on HIVE-7205:


Oh, I see. In the current patch, "isLastInput" takes special care for 
MuxOperator, so we will not generate wrong results. However, with this version, 
if my understanding is correct, we have to buffer rows from all tables in the 
reduce side join operator for cases like the last query in 
correlationoptimizer15.q (the right most table will not be streamable and we 
will have a higher memory footprint). I am not sure we want this behavior.

I think one thing we may want to investigate is what will be the minimal change 
that can just fix the bug. I totally agree to improve the logic of 
startGroup()/endGroup()/flush(). I guess we need to have a clear plan first.

[~ashutoshc] [~navis] I may not be able to come up with a patch soon. When will 
be our next release?

> Wrong results when union all of grouping followed by group by with 
> correlation optimization
> ---
>
> Key: HIVE-7205
> URL: https://issues.apache.org/jira/browse/HIVE-7205
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.13.0, 0.13.1
>Reporter: dima machlin
>Assignee: Navis
>Priority: Critical
> Attachments: HIVE-7205.1.patch.txt, HIVE-7205.2.patch.txt, 
> HIVE-7205.3.patch.txt
>
>
> use case :
> table TBL (a string,b string) contains single row : 'a','a'
> the following query :
> {code:sql}
> select b, sum(cc) from (
> select b,count(1) as cc from TBL group by b
> union all
> select a as b,count(1) as cc from TBL group by a
> ) z
> group by b
> {code}
> returns 
> a 1
> a 1
> while set hive.optimize.correlation=true;
> if we change set hive.optimize.correlation=false;
> it returns correct results : a 2
> The plan with correlation optimization :
> {code:sql}
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM 
> (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
> TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR 
> (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b (TOK_QUERY 
> (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION 
> (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) 
> (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL 
> a) z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT 
> (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum 
> (TOK_TABLE_OR_COL cc (TOK_GROUPBY (TOK_TABLE_OR_COL b
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Alias -> Map Operator Tree:
> null-subquery1:z-subquery1:TBL 
>   TableScan
> alias: TBL
> Select Operator
>   expressions:
> expr: b
> type: string
>   outputColumnNames: b
>   Group By Operator
> aggregations:
>   expr: count(1)
> bucketGroup: false
> keys:
>   expr: b
>   type: string
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
>   key expressions:
> expr: _col0
> type: string
>   sort order: +
>   Map-reduce partition columns:
> expr: _col0
> type: string
>   tag: 0
>   value expressions:
> expr: _col1
> type: bigint
> null-subquery2:z-subquery2:TBL 
>   TableScan
> alias: TBL
> Select Operator
>   expressions:
> expr: a
> type: string
>   outputColumnNames: a
>   Group By Operator
> aggregations:
>   expr: count(1)
> bucketGroup: false
> keys:
>   expr: a
>   type: string
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
>   key expressions:
> expr: _col0
> type: string
>   sort order: +
>   Map-reduce partition columns:
> expr: _col0
> type: string
>   tag: 1
>   

[jira] [Commented] (HIVE-7205) Wrong results when union all of grouping followed by group by with correlation optimization

2014-08-03 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14084158#comment-14084158
 ] 

Yin Huai commented on HIVE-7205:


My main concern is that because we use the right most table as the stream 
table, if hive.join.emit.interval is small, we can generate wrong results if we 
do not have endGroupIfNecessary.

> Wrong results when union all of grouping followed by group by with 
> correlation optimization
> ---
>
> Key: HIVE-7205
> URL: https://issues.apache.org/jira/browse/HIVE-7205
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.13.0, 0.13.1
>Reporter: dima machlin
>Assignee: Navis
>Priority: Critical
> Attachments: HIVE-7205.1.patch.txt, HIVE-7205.2.patch.txt, 
> HIVE-7205.3.patch.txt
>
>
> use case :
> table TBL (a string,b string) contains single row : 'a','a'
> the following query :
> {code:sql}
> select b, sum(cc) from (
> select b,count(1) as cc from TBL group by b
> union all
> select a as b,count(1) as cc from TBL group by a
> ) z
> group by b
> {code}
> returns 
> a 1
> a 1
> while set hive.optimize.correlation=true;
> if we change set hive.optimize.correlation=false;
> it returns correct results : a 2
> The plan with correlation optimization :
> {code:sql}
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM 
> (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
> TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR 
> (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b (TOK_QUERY 
> (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION 
> (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) 
> (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL 
> a) z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT 
> (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum 
> (TOK_TABLE_OR_COL cc (TOK_GROUPBY (TOK_TABLE_OR_COL b
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Alias -> Map Operator Tree:
> null-subquery1:z-subquery1:TBL 
>   TableScan
> alias: TBL
> Select Operator
>   expressions:
> expr: b
> type: string
>   outputColumnNames: b
>   Group By Operator
> aggregations:
>   expr: count(1)
> bucketGroup: false
> keys:
>   expr: b
>   type: string
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
>   key expressions:
> expr: _col0
> type: string
>   sort order: +
>   Map-reduce partition columns:
> expr: _col0
> type: string
>   tag: 0
>   value expressions:
> expr: _col1
> type: bigint
> null-subquery2:z-subquery2:TBL 
>   TableScan
> alias: TBL
> Select Operator
>   expressions:
> expr: a
> type: string
>   outputColumnNames: a
>   Group By Operator
> aggregations:
>   expr: count(1)
> bucketGroup: false
> keys:
>   expr: a
>   type: string
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
>   key expressions:
> expr: _col0
> type: string
>   sort order: +
>   Map-reduce partition columns:
> expr: _col0
> type: string
>   tag: 1
>   value expressions:
> expr: _col1
> type: bigint
>   Reduce Operator Tree:
> Demux Operator
>   Group By Operator
> aggregations:
>   expr: count(VALUE._col0)
> bucketGroup: false
> keys:
>   expr: KEY._col0
>   type: string
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
>   expressions:
> expr: _col0
> type: strin

[jira] [Commented] (HIVE-7205) Wrong results when union all of grouping followed by group by with correlation optimization

2014-08-01 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14082384#comment-14082384
 ] 

Yin Huai commented on HIVE-7205:


No yet. I will try to find sometime during the weekend.

> Wrong results when union all of grouping followed by group by with 
> correlation optimization
> ---
>
> Key: HIVE-7205
> URL: https://issues.apache.org/jira/browse/HIVE-7205
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.13.0, 0.13.1
>Reporter: dima machlin
>Assignee: Navis
>Priority: Critical
> Attachments: HIVE-7205.1.patch.txt, HIVE-7205.2.patch.txt, 
> HIVE-7205.3.patch.txt
>
>
> use case :
> table TBL (a string,b string) contains single row : 'a','a'
> the following query :
> {code:sql}
> select b, sum(cc) from (
> select b,count(1) as cc from TBL group by b
> union all
> select a as b,count(1) as cc from TBL group by a
> ) z
> group by b
> {code}
> returns 
> a 1
> a 1
> while set hive.optimize.correlation=true;
> if we change set hive.optimize.correlation=false;
> it returns correct results : a 2
> The plan with correlation optimization :
> {code:sql}
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM 
> (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
> TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR 
> (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b (TOK_QUERY 
> (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION 
> (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) 
> (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL 
> a) z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT 
> (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum 
> (TOK_TABLE_OR_COL cc (TOK_GROUPBY (TOK_TABLE_OR_COL b
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Alias -> Map Operator Tree:
> null-subquery1:z-subquery1:TBL 
>   TableScan
> alias: TBL
> Select Operator
>   expressions:
> expr: b
> type: string
>   outputColumnNames: b
>   Group By Operator
> aggregations:
>   expr: count(1)
> bucketGroup: false
> keys:
>   expr: b
>   type: string
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
>   key expressions:
> expr: _col0
> type: string
>   sort order: +
>   Map-reduce partition columns:
> expr: _col0
> type: string
>   tag: 0
>   value expressions:
> expr: _col1
> type: bigint
> null-subquery2:z-subquery2:TBL 
>   TableScan
> alias: TBL
> Select Operator
>   expressions:
> expr: a
> type: string
>   outputColumnNames: a
>   Group By Operator
> aggregations:
>   expr: count(1)
> bucketGroup: false
> keys:
>   expr: a
>   type: string
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
>   key expressions:
> expr: _col0
> type: string
>   sort order: +
>   Map-reduce partition columns:
> expr: _col0
> type: string
>   tag: 1
>   value expressions:
> expr: _col1
> type: bigint
>   Reduce Operator Tree:
> Demux Operator
>   Group By Operator
> aggregations:
>   expr: count(VALUE._col0)
> bucketGroup: false
> keys:
>   expr: KEY._col0
>   type: string
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
>   expressions:
> expr: _col0
> type: string
> expr: _col1
> type: bigint
>   outputColumnNames: _col0, _col1
> 

[jira] [Commented] (HIVE-7205) Wrong results when union all of grouping followed by group by with correlation optimization

2014-07-13 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060383#comment-14060383
 ] 

Yin Huai commented on HIVE-7205:


[~navis] Thank you for the patch. I have left some comments at review board. In 
general, I feel that the logical on startGroup and endGroup is not very clear 
(my original implementation is not very clear either...). Can you explain the 
logic? So, I can better understand your change. Thanks.

> Wrong results when union all of grouping followed by group by with 
> correlation optimization
> ---
>
> Key: HIVE-7205
> URL: https://issues.apache.org/jira/browse/HIVE-7205
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.13.0, 0.13.1
>Reporter: dima machlin
>Assignee: Navis
>Priority: Critical
> Attachments: HIVE-7205.1.patch.txt, HIVE-7205.2.patch.txt, 
> HIVE-7205.3.patch.txt
>
>
> use case :
> table TBL (a string,b string) contains single row : 'a','a'
> the following query :
> {code:sql}
> select b, sum(cc) from (
> select b,count(1) as cc from TBL group by b
> union all
> select a as b,count(1) as cc from TBL group by a
> ) z
> group by b
> {code}
> returns 
> a 1
> a 1
> while set hive.optimize.correlation=true;
> if we change set hive.optimize.correlation=false;
> it returns correct results : a 2
> The plan with correlation optimization :
> {code:sql}
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM 
> (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
> TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR 
> (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b (TOK_QUERY 
> (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION 
> (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) 
> (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL 
> a) z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT 
> (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum 
> (TOK_TABLE_OR_COL cc (TOK_GROUPBY (TOK_TABLE_OR_COL b
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Alias -> Map Operator Tree:
> null-subquery1:z-subquery1:TBL 
>   TableScan
> alias: TBL
> Select Operator
>   expressions:
> expr: b
> type: string
>   outputColumnNames: b
>   Group By Operator
> aggregations:
>   expr: count(1)
> bucketGroup: false
> keys:
>   expr: b
>   type: string
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
>   key expressions:
> expr: _col0
> type: string
>   sort order: +
>   Map-reduce partition columns:
> expr: _col0
> type: string
>   tag: 0
>   value expressions:
> expr: _col1
> type: bigint
> null-subquery2:z-subquery2:TBL 
>   TableScan
> alias: TBL
> Select Operator
>   expressions:
> expr: a
> type: string
>   outputColumnNames: a
>   Group By Operator
> aggregations:
>   expr: count(1)
> bucketGroup: false
> keys:
>   expr: a
>   type: string
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
>   key expressions:
> expr: _col0
> type: string
>   sort order: +
>   Map-reduce partition columns:
> expr: _col0
> type: string
>   tag: 1
>   value expressions:
> expr: _col1
> type: bigint
>   Reduce Operator Tree:
> Demux Operator
>   Group By Operator
> aggregations:
>   expr: count(VALUE._col0)
> bucketGroup: false
> keys:
>   expr: KEY._col0
>   type: string
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select

[jira] [Commented] (HIVE-5130) Document Correlation Optimizer in Hive wiki

2014-07-13 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060355#comment-14060355
 ] 

Yin Huai commented on HIVE-5130:


Thanks [~leftylev] Let's put it in the "Completed" section. 

> Document Correlation Optimizer in Hive wiki
> ---
>
> Key: HIVE-5130
> URL: https://issues.apache.org/jira/browse/HIVE-5130
> Project: Hive
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Yin Huai
>Assignee: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-5130) Document Correlation Optimizer in Hive wiki

2014-07-13 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060292#comment-14060292
 ] 

Yin Huai commented on HIVE-5130:


Design doc in Hive wiki: 
https://cwiki.apache.org/confluence/display/Hive/Correlation+Optimizer


> Document Correlation Optimizer in Hive wiki
> ---
>
> Key: HIVE-5130
> URL: https://issues.apache.org/jira/browse/HIVE-5130
> Project: Hive
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Yin Huai
>Assignee: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7205) Wrong results when union all of grouping followed by group by with correlation optimization

2014-07-07 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054454#comment-14054454
 ] 

Yin Huai commented on HIVE-7205:


[~navis] Simplifying interactions between operators is good. Let me spend 
sometime to understand the patch. My recent schedule is quite tight. I hope I 
can get you back late this week. Just want to double check. We will not have 
our next release for a while, right?

> Wrong results when union all of grouping followed by group by with 
> correlation optimization
> ---
>
> Key: HIVE-7205
> URL: https://issues.apache.org/jira/browse/HIVE-7205
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.13.0, 0.13.1
>Reporter: dima machlin
>Assignee: Navis
>Priority: Critical
> Attachments: HIVE-7205.1.patch.txt, HIVE-7205.2.patch.txt, 
> HIVE-7205.3.patch.txt
>
>
> use case :
> table TBL (a string,b string) contains single row : 'a','a'
> the following query :
> {code:sql}
> select b, sum(cc) from (
> select b,count(1) as cc from TBL group by b
> union all
> select a as b,count(1) as cc from TBL group by a
> ) z
> group by b
> {code}
> returns 
> a 1
> a 1
> while set hive.optimize.correlation=true;
> if we change set hive.optimize.correlation=false;
> it returns correct results : a 2
> The plan with correlation optimization :
> {code:sql}
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM 
> (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
> TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR 
> (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b (TOK_QUERY 
> (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION 
> (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) 
> (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL 
> a) z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT 
> (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum 
> (TOK_TABLE_OR_COL cc (TOK_GROUPBY (TOK_TABLE_OR_COL b
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Alias -> Map Operator Tree:
> null-subquery1:z-subquery1:TBL 
>   TableScan
> alias: TBL
> Select Operator
>   expressions:
> expr: b
> type: string
>   outputColumnNames: b
>   Group By Operator
> aggregations:
>   expr: count(1)
> bucketGroup: false
> keys:
>   expr: b
>   type: string
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
>   key expressions:
> expr: _col0
> type: string
>   sort order: +
>   Map-reduce partition columns:
> expr: _col0
> type: string
>   tag: 0
>   value expressions:
> expr: _col1
> type: bigint
> null-subquery2:z-subquery2:TBL 
>   TableScan
> alias: TBL
> Select Operator
>   expressions:
> expr: a
> type: string
>   outputColumnNames: a
>   Group By Operator
> aggregations:
>   expr: count(1)
> bucketGroup: false
> keys:
>   expr: a
>   type: string
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
>   key expressions:
> expr: _col0
> type: string
>   sort order: +
>   Map-reduce partition columns:
> expr: _col0
> type: string
>   tag: 1
>   value expressions:
> expr: _col1
> type: bigint
>   Reduce Operator Tree:
> Demux Operator
>   Group By Operator
> aggregations:
>   expr: count(VALUE._col0)
> bucketGroup: false
> keys:
>   expr: KEY._col0
>   type: string
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator
>   

[jira] [Commented] (HIVE-7205) Wrong results when union all of grouping followed by group by with correlation optimization

2014-07-07 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054303#comment-14054303
 ] 

Yin Huai commented on HIVE-7205:


Sure. I will take a look at it.

Seems the issue is that the MuxOperator for the last GroupByOperator cannot 
correctly determine when to call flush/endGroup/processGroup of the 
GroupByOperator because the UnionOperator creates a merging point of two 
branches in the operator tree.


> Wrong results when union all of grouping followed by group by with 
> correlation optimization
> ---
>
> Key: HIVE-7205
> URL: https://issues.apache.org/jira/browse/HIVE-7205
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.13.0, 0.13.1
>Reporter: dima machlin
>Assignee: Navis
>Priority: Critical
> Attachments: HIVE-7205.1.patch.txt, HIVE-7205.2.patch.txt, 
> HIVE-7205.3.patch.txt
>
>
> use case :
> table TBL (a string,b string) contains single row : 'a','a'
> the following query :
> {code:sql}
> select b, sum(cc) from (
> select b,count(1) as cc from TBL group by b
> union all
> select a as b,count(1) as cc from TBL group by a
> ) z
> group by b
> {code}
> returns 
> a 1
> a 1
> while set hive.optimize.correlation=true;
> if we change set hive.optimize.correlation=false;
> it returns correct results : a 2
> The plan with correlation optimization :
> {code:sql}
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM 
> (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
> TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR 
> (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b (TOK_QUERY 
> (TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION 
> (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) 
> (TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL 
> a) z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT 
> (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum 
> (TOK_TABLE_OR_COL cc (TOK_GROUPBY (TOK_TABLE_OR_COL b
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
> Map Reduce
>   Alias -> Map Operator Tree:
> null-subquery1:z-subquery1:TBL 
>   TableScan
> alias: TBL
> Select Operator
>   expressions:
> expr: b
> type: string
>   outputColumnNames: b
>   Group By Operator
> aggregations:
>   expr: count(1)
> bucketGroup: false
> keys:
>   expr: b
>   type: string
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
>   key expressions:
> expr: _col0
> type: string
>   sort order: +
>   Map-reduce partition columns:
> expr: _col0
> type: string
>   tag: 0
>   value expressions:
> expr: _col1
> type: bigint
> null-subquery2:z-subquery2:TBL 
>   TableScan
> alias: TBL
> Select Operator
>   expressions:
> expr: a
> type: string
>   outputColumnNames: a
>   Group By Operator
> aggregations:
>   expr: count(1)
> bucketGroup: false
> keys:
>   expr: a
>   type: string
> mode: hash
> outputColumnNames: _col0, _col1
> Reduce Output Operator
>   key expressions:
> expr: _col0
> type: string
>   sort order: +
>   Map-reduce partition columns:
> expr: _col0
> type: string
>   tag: 1
>   value expressions:
> expr: _col1
> type: bigint
>   Reduce Operator Tree:
> Demux Operator
>   Group By Operator
> aggregations:
>   expr: count(VALUE._col0)
> bucketGroup: false
> keys:
>   expr: KEY._col0
>   type: string
> mode: mergepartial
> outputColumnNames: _col0, _col1
> Select Operator

[jira] [Created] (HIVE-7362) Enabling Correlation Optimizer by default.

2014-07-07 Thread Yin Huai (JIRA)
Yin Huai created HIVE-7362:
--

 Summary: Enabling Correlation Optimizer by default.
 Key: HIVE-7362
 URL: https://issues.apache.org/jira/browse/HIVE-7362
 Project: Hive
  Issue Type: Task
  Components: Query Processor
Reporter: Yin Huai
Assignee: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-7222) Support timestamp column statistics in ORC and extend PPD for timestamp

2014-06-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-7222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028769#comment-14028769
 ] 

Yin Huai commented on HIVE-7222:


[~prasanth_j] Unfortunately, I am not working on that. 

> Support timestamp column statistics in ORC and extend PPD for timestamp
> ---
>
> Key: HIVE-7222
> URL: https://issues.apache.org/jira/browse/HIVE-7222
> Project: Hive
>  Issue Type: Improvement
>  Components: File Formats
>Affects Versions: 0.14.0
>Reporter: Prasanth J
>  Labels: orcfile
>
> Add column statistics for timestamp columns in ORC. Also extend predicate 
> pushdown to support timestamp column evaluation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (HIVE-6631) NPE when select a field of a struct from a table stored by ORC

2014-03-21 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved HIVE-6631.


Resolution: Duplicate

> NPE when select a field of a struct from a table stored by ORC
> --
>
> Key: HIVE-6631
> URL: https://issues.apache.org/jira/browse/HIVE-6631
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor, Serializers/Deserializers
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Yin Huai
>
> I have a table like this ...
> {code:sql}
> create table lineitem_orc_cg
> (
> CG1 STRUCTL_SUPPKEY:INT,
>L_COMMITDATE:STRING,
>L_RECEIPTDATE:STRING,
>L_SHIPINSTRUCT:STRING,
>L_SHIPMODE:STRING,
>L_COMMENT:STRING,
>L_TAX:float,
>L_RETURNFLAG:STRING,
>L_LINESTATUS:STRING,
>L_LINENUMBER:INT,
>L_ORDERKEY:INT>,
> CG2 STRUCTL_EXTENDEDPRICE:float,
>L_DISCOUNT:float,
>L_SHIPDATE:STRING>
> )
> row format serde 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> stored as orc tblproperties ("orc.compress"="NONE");
> {code}
> When I want to select a field from a struct by using
> {code:sql}
> select cg1.l_comment from lineitem_orc_cg limit 1;
> {code}
> I got 
> {code}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.ExprNodeFieldEvaluator.initialize(ExprNodeFieldEvaluator.java:61)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:928)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:954)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:65)
>   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
>   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:459)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:415)
>   at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:189)
>   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:409)
>   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
>   at 
> org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:133)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6716) ORC struct throws NPE for tables with inner structs having null values

2014-03-21 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13943843#comment-13943843
 ] 

Yin Huai commented on HIVE-6716:


ok. Have marked that one. Thanks.

> ORC struct throws NPE for tables with inner structs having null values 
> ---
>
> Key: HIVE-6716
> URL: https://issues.apache.org/jira/browse/HIVE-6716
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Prasanth J
>Assignee: Prasanth J
>  Labels: orcfile
> Attachments: HIVE-6716.1.patch
>
>
> ORCStruct should return null when object passed to 
> getStructFieldsDataAsList(Object obj) is null.
> {code}
> public List getStructFieldsDataAsList(Object object) {
>   OrcStruct struct = (OrcStruct) object;
>   List result = new ArrayList(struct.fields.length);
> {code}
> In the above code struct.fields will throw NPE if struct is NULL.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6716) ORC struct throws NPE for tables with inner structs having null values

2014-03-21 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13943839#comment-13943839
 ] 

Yin Huai commented on HIVE-6716:


[~prasanth_j] It is the same bug as I mentioned in 
https://issues.apache.org/jira/browse/HIVE-6631, right? If so, I will mark that 
one as duplicate.

> ORC struct throws NPE for tables with inner structs having null values 
> ---
>
> Key: HIVE-6716
> URL: https://issues.apache.org/jira/browse/HIVE-6716
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Prasanth J
>Assignee: Prasanth J
>  Labels: orcfile
> Attachments: HIVE-6716.1.patch
>
>
> ORCStruct should return null when object passed to 
> getStructFieldsDataAsList(Object obj) is null.
> {code}
> public List getStructFieldsDataAsList(Object object) {
>   OrcStruct struct = (OrcStruct) object;
>   List result = new ArrayList(struct.fields.length);
> {code}
> In the above code struct.fields will throw NPE if struct is NULL.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6432) Remove deprecated methods in HCatalog

2014-03-15 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13936175#comment-13936175
 ] 

Yin Huai commented on HIVE-6432:


I tried to generate the tarball with 
{code}
mvn clean package -DskipTests -Phadoop-1 -Pdist
{code}
and got the following error
{code}
[ERROR] Failed to execute goal on project hive-packaging: Could not resolve 
dependencies for project org.apache.hive:hive-packaging:pom:0.14.0-SNAPSHOT: 
Failure to find 
org.apache.hive.hcatalog:hive-hcatalog-hbase-storage-handler:jar:0.14.0-SNAPSHOT
 in http://repository.apache.org/snapshots was cached in the local repository, 
resolution will not be reattempted until the update interval of 
apache.snapshots has elapsed or updates are forced -> [Help 1]
{code}

I removed this entry 
(https://github.com/apache/hive/blob/trunk/packaging/pom.xml#L135) and this 
entry 
(https://github.com/apache/hive/blob/trunk/packaging/src/main/assembly/bin.xml#L57)
 to make the packing work. Is there any other update needed?

> Remove deprecated methods in HCatalog
> -
>
> Key: HIVE-6432
> URL: https://issues.apache.org/jira/browse/HIVE-6432
> Project: Hive
>  Issue Type: Task
>  Components: HCatalog
>Affects Versions: 0.14.0
>Reporter: Sushanth Sowmyan
>Assignee: Sushanth Sowmyan
> Fix For: 0.14.0
>
> Attachments: HIVE-6432.patch, HIVE-6432.wip.1.patch, 
> HIVE-6432.wip.2.patch, hcat.6432.test.out
>
>
> There are a lot of methods in HCatalog that have been deprecated in HCatalog 
> 0.5, and some that were recently deprecated in Hive 0.11 (joint release with 
> HCatalog).
> The goal for HCatalog deprecation is that in general, after something has 
> been deprecated, it is expected to stay around for 2 releases, which means 
> hive-0.13 will be the last release to ship with all the methods that were 
> deprecated in hive-0.11 (the org.apache.hcatalog.* files should all be 
> removed afterwards), and it is also good for us to clean out and nuke all 
> other older deprecated methods.
> We should take this on early in a dev/release cycle to allow us time to 
> resolve all fallout, so I propose that we remove all HCatalog deprecated 
> methods after we branch out 0.13 and 0.14 becomes trunk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6668) When auto join convert is on and noconditionaltask is off, ConditionalResolverCommonJoin fails to resolve map joins.

2014-03-14 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13935455#comment-13935455
 ] 

Yin Huai commented on HIVE-6668:


TestConditionalResolverCommonJoin cannot catch this bug.

> When auto join convert is on and noconditionaltask is off, 
> ConditionalResolverCommonJoin fails to resolve map joins.
> 
>
> Key: HIVE-6668
> URL: https://issues.apache.org/jira/browse/HIVE-6668
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Yin Huai
>Priority: Blocker
> Fix For: 0.13.0
>
>
> I tried the following query today ...
> {code:sql}
> set mapred.job.map.memory.mb=2048;
> set mapred.job.reduce.memory.mb=2048;
> set mapred.map.child.java.opts=-server -Xmx3072m 
> -Djava.net.preferIPv4Stack=true;
> set mapred.reduce.child.java.opts=-server -Xmx3072m 
> -Djava.net.preferIPv4Stack=true;
> set mapred.reduce.tasks=60;
> set hive.stats.autogather=false;
> set hive.exec.parallel=false;
> set hive.enforce.bucketing=true;
> set hive.enforce.sorting=true;
> set hive.map.aggr=true;
> set hive.optimize.bucketmapjoin=true;
> set hive.optimize.bucketmapjoin.sortedmerge=true;
> set hive.mapred.reduce.tasks.speculative.execution=false;
> set hive.auto.convert.join=true;
> set hive.auto.convert.sortmerge.join=true;
> set hive.auto.convert.sortmerge.join.noconditionaltask=false;
> set hive.auto.convert.join.noconditionaltask=false;
> set hive.auto.convert.join.noconditionaltask.size=1;
> set hive.optimize.reducededuplication=true;
> set hive.optimize.reducededuplication.min.reducer=1;
> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
> set hive.mapjoin.smalltable.filesize=4500;
> set hive.optimize.index.filter=false;
> set hive.vectorized.execution.enabled=false;
> set hive.optimize.correlation=false;
> select
>i_item_id,
>s_state,
>avg(ss_quantity) agg1,
>avg(ss_list_price) agg2,
>avg(ss_coupon_amt) agg3,
>avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
> customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>cd_gender = 'F' and
>cd_marital_status = 'U' and
>cd_education_status = 'Primary' and
>d_year = 2002 and
>s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by i_item_id, s_state with rollup
> order by
>i_item_id,
>s_state
> limit 100;
> {code}
> The log shows ...
> {code}
> 14/03/14 17:05:02 INFO plan.ConditionalResolverCommonJoin: Failed to resolve 
> driver alias (threshold : 4500, length mapping : {store=94175, 
> store_sales=48713909726, item=39798667, customer_demographics=1660831, 
> date_dim=2275902})
> Stage-27 is filtered out by condition resolver.
> 14/03/14 17:05:02 INFO exec.Task: Stage-27 is filtered out by condition 
> resolver.
> Stage-28 is filtered out by condition resolver.
> 14/03/14 17:05:02 INFO exec.Task: Stage-28 is filtered out by condition 
> resolver.
> Stage-3 is selected by condition resolver.
> {code}
> Stage-3 is a reduce join. Actually, the resolver should pick the map join



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6668) When auto join convert is on and noconditionaltask is off, ConditionalResolverCommonJoin fails to resolve map joins.

2014-03-14 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13935397#comment-13935397
 ] 

Yin Huai commented on HIVE-6668:


Seems aliases returned from this line 
(https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverCommonJoin.java#L178)
 is an empty set.

> When auto join convert is on and noconditionaltask is off, 
> ConditionalResolverCommonJoin fails to resolve map joins.
> 
>
> Key: HIVE-6668
> URL: https://issues.apache.org/jira/browse/HIVE-6668
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Yin Huai
>Priority: Blocker
> Fix For: 0.13.0
>
>
> I tried the following query today ...
> {code:sql}
> set mapred.job.map.memory.mb=2048;
> set mapred.job.reduce.memory.mb=2048;
> set mapred.map.child.java.opts=-server -Xmx3072m 
> -Djava.net.preferIPv4Stack=true;
> set mapred.reduce.child.java.opts=-server -Xmx3072m 
> -Djava.net.preferIPv4Stack=true;
> set mapred.reduce.tasks=60;
> set hive.stats.autogather=false;
> set hive.exec.parallel=false;
> set hive.enforce.bucketing=true;
> set hive.enforce.sorting=true;
> set hive.map.aggr=true;
> set hive.optimize.bucketmapjoin=true;
> set hive.optimize.bucketmapjoin.sortedmerge=true;
> set hive.mapred.reduce.tasks.speculative.execution=false;
> set hive.auto.convert.join=true;
> set hive.auto.convert.sortmerge.join=true;
> set hive.auto.convert.sortmerge.join.noconditionaltask=false;
> set hive.auto.convert.join.noconditionaltask=false;
> set hive.auto.convert.join.noconditionaltask.size=1;
> set hive.optimize.reducededuplication=true;
> set hive.optimize.reducededuplication.min.reducer=1;
> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
> set hive.mapjoin.smalltable.filesize=4500;
> set hive.optimize.index.filter=false;
> set hive.vectorized.execution.enabled=false;
> set hive.optimize.correlation=false;
> select
>i_item_id,
>s_state,
>avg(ss_quantity) agg1,
>avg(ss_list_price) agg2,
>avg(ss_coupon_amt) agg3,
>avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
> customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>cd_gender = 'F' and
>cd_marital_status = 'U' and
>cd_education_status = 'Primary' and
>d_year = 2002 and
>s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by i_item_id, s_state with rollup
> order by
>i_item_id,
>s_state
> limit 100;
> {code}
> The log shows ...
> {code}
> 14/03/14 17:05:02 INFO plan.ConditionalResolverCommonJoin: Failed to resolve 
> driver alias (threshold : 4500, length mapping : {store=94175, 
> store_sales=48713909726, item=39798667, customer_demographics=1660831, 
> date_dim=2275902})
> Stage-27 is filtered out by condition resolver.
> 14/03/14 17:05:02 INFO exec.Task: Stage-27 is filtered out by condition 
> resolver.
> Stage-28 is filtered out by condition resolver.
> 14/03/14 17:05:02 INFO exec.Task: Stage-28 is filtered out by condition 
> resolver.
> Stage-3 is selected by condition resolver.
> {code}
> Stage-3 is a reduce join. Actually, the resolver should pick the map join



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6668) When auto join convert is on and noconditionaltask is off, ConditionalResolverCommonJoin fails to resolve map joins.

2014-03-14 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13935290#comment-13935290
 ] 

Yin Huai commented on HIVE-6668:


I guess it was broken by HIVE-6403 or HIVE-6144.

> When auto join convert is on and noconditionaltask is off, 
> ConditionalResolverCommonJoin fails to resolve map joins.
> 
>
> Key: HIVE-6668
> URL: https://issues.apache.org/jira/browse/HIVE-6668
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Yin Huai
>Priority: Blocker
> Fix For: 0.13.0
>
>
> I tried the following query today ...
> {code:sql}
> set mapred.job.map.memory.mb=2048;
> set mapred.job.reduce.memory.mb=2048;
> set mapred.map.child.java.opts=-server -Xmx3072m 
> -Djava.net.preferIPv4Stack=true;
> set mapred.reduce.child.java.opts=-server -Xmx3072m 
> -Djava.net.preferIPv4Stack=true;
> set mapred.reduce.tasks=60;
> set hive.stats.autogather=false;
> set hive.exec.parallel=false;
> set hive.enforce.bucketing=true;
> set hive.enforce.sorting=true;
> set hive.map.aggr=true;
> set hive.optimize.bucketmapjoin=true;
> set hive.optimize.bucketmapjoin.sortedmerge=true;
> set hive.mapred.reduce.tasks.speculative.execution=false;
> set hive.auto.convert.join=true;
> set hive.auto.convert.sortmerge.join=true;
> set hive.auto.convert.sortmerge.join.noconditionaltask=false;
> set hive.auto.convert.join.noconditionaltask=false;
> set hive.auto.convert.join.noconditionaltask.size=1;
> set hive.optimize.reducededuplication=true;
> set hive.optimize.reducededuplication.min.reducer=1;
> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
> set hive.mapjoin.smalltable.filesize=4500;
> set hive.optimize.index.filter=false;
> set hive.vectorized.execution.enabled=false;
> set hive.optimize.correlation=false;
> select
>i_item_id,
>s_state,
>avg(ss_quantity) agg1,
>avg(ss_list_price) agg2,
>avg(ss_coupon_amt) agg3,
>avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
> customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>cd_gender = 'F' and
>cd_marital_status = 'U' and
>cd_education_status = 'Primary' and
>d_year = 2002 and
>s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by i_item_id, s_state with rollup
> order by
>i_item_id,
>s_state
> limit 100;
> {code}
> The log shows ...
> {code}
> 14/03/14 17:05:02 INFO plan.ConditionalResolverCommonJoin: Failed to resolve 
> driver alias (threshold : 4500, length mapping : {store=94175, 
> store_sales=48713909726, item=39798667, customer_demographics=1660831, 
> date_dim=2275902})
> Stage-27 is filtered out by condition resolver.
> 14/03/14 17:05:02 INFO exec.Task: Stage-27 is filtered out by condition 
> resolver.
> Stage-28 is filtered out by condition resolver.
> 14/03/14 17:05:02 INFO exec.Task: Stage-28 is filtered out by condition 
> resolver.
> Stage-3 is selected by condition resolver.
> {code}
> Stage-3 is a reduce join. Actually, the resolver should pick the map join



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HIVE-6668) When auto join convert is on and noconditionaltask is off, ConditionalResolverCommonJoin fails to resolve map joins.

2014-03-14 Thread Yin Huai (JIRA)
Yin Huai created HIVE-6668:
--

 Summary: When auto join convert is on and noconditionaltask is 
off, ConditionalResolverCommonJoin fails to resolve map joins.
 Key: HIVE-6668
 URL: https://issues.apache.org/jira/browse/HIVE-6668
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.13.0, 0.14.0
Reporter: Yin Huai
Priority: Blocker
 Fix For: 0.13.0


I tried the following query today ...
{code:sql}
set mapred.job.map.memory.mb=2048;
set mapred.job.reduce.memory.mb=2048;
set mapred.map.child.java.opts=-server -Xmx3072m 
-Djava.net.preferIPv4Stack=true;
set mapred.reduce.child.java.opts=-server -Xmx3072m 
-Djava.net.preferIPv4Stack=true;

set mapred.reduce.tasks=60;

set hive.stats.autogather=false;
set hive.exec.parallel=false;
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
set hive.map.aggr=true;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.mapred.reduce.tasks.speculative.execution=false;
set hive.auto.convert.join=true;
set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.sortmerge.join.noconditionaltask=false;
set hive.auto.convert.join.noconditionaltask=false;
set hive.auto.convert.join.noconditionaltask.size=1;
set hive.optimize.reducededuplication=true;
set hive.optimize.reducededuplication.min.reducer=1;
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
set hive.mapjoin.smalltable.filesize=4500;

set hive.optimize.index.filter=false;
set hive.vectorized.execution.enabled=false;
set hive.optimize.correlation=false;
select
   i_item_id,
   s_state,
   avg(ss_quantity) agg1,
   avg(ss_list_price) agg2,
   avg(ss_coupon_amt) agg3,
   avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
   cd_gender = 'F' and
   cd_marital_status = 'U' and
   cd_education_status = 'Primary' and
   d_year = 2002 and
   s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by i_item_id, s_state with rollup
order by
   i_item_id,
   s_state
limit 100;
{code}

The log shows ...
{code}
14/03/14 17:05:02 INFO plan.ConditionalResolverCommonJoin: Failed to resolve 
driver alias (threshold : 4500, length mapping : {store=94175, 
store_sales=48713909726, item=39798667, customer_demographics=1660831, 
date_dim=2275902})
Stage-27 is filtered out by condition resolver.
14/03/14 17:05:02 INFO exec.Task: Stage-27 is filtered out by condition 
resolver.
Stage-28 is filtered out by condition resolver.
14/03/14 17:05:02 INFO exec.Task: Stage-28 is filtered out by condition 
resolver.
Stage-3 is selected by condition resolver.
{code}
Stage-3 is a reduce join. Actually, the resolver should pick the map join



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6632) ORC should be able to only read needed fields in a complex column

2014-03-12 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13931978#comment-13931978
 ] 

Yin Huai commented on HIVE-6632:


Does Parquet have the same issue?

> ORC should be able to only read needed fields in a complex column
> -
>
> Key: HIVE-6632
> URL: https://issues.apache.org/jira/browse/HIVE-6632
> Project: Hive
>  Issue Type: Improvement
>Reporter: Yin Huai
>
> Currently, we use a string of ids to record needed columns. However, this 
> string cannot record needed fields of a complex column. Although ORC 
> decomposes a complex column to multiple sub-columns, it has to load the 
> entire complex column if only a single field of this complex column is needed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HIVE-6632) ORC should be able to only read needed fields in a complex column

2014-03-12 Thread Yin Huai (JIRA)
Yin Huai created HIVE-6632:
--

 Summary: ORC should be able to only read needed fields in a 
complex column
 Key: HIVE-6632
 URL: https://issues.apache.org/jira/browse/HIVE-6632
 Project: Hive
  Issue Type: Improvement
Reporter: Yin Huai


Currently, we use a string of ids to record needed columns. However, this 
string cannot record needed fields of a complex column. Although ORC decomposes 
a complex column to multiple sub-columns, it has to load the entire complex 
column if only a single field of this complex column is needed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-6631) NPE when select a field of a struct from a table stored by ORC

2014-03-12 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6631:
---

Description: 
I have a table like this ...
{code:sql}
create table lineitem_orc_cg
(
CG1 STRUCT,
CG2 STRUCT
)
row format serde 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
stored as orc tblproperties ("orc.compress"="NONE");
{code}
When I want to select a field from a struct by using
{code:sql}
select cg1.l_comment from lineitem_orc_cg limit 1;
{code}

I got 
{code}
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.ExprNodeFieldEvaluator.initialize(ExprNodeFieldEvaluator.java:61)
at 
org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:928)
at 
org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:954)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:65)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:459)
at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:415)
at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:189)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at 
org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:409)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at 
org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:133)
... 22 more
{code}

  was:
I have two tables lineitem_orc_cg
{code:sql}
create table lineitem_orc_cg
(
CG1 STRUCT,
CG2 STRUCT
)
row format serde 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
stored as orc tblproperties ("orc.compress"="NONE");
{code}
When I want to select a field from a struct by using
{code:sql}
select cg1.l_comment from lineitem_orc_cg limit 1;
{code}

I got 
{code}
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.ExprNodeFieldEvaluator.initialize(ExprNodeFieldEvaluator.java:61)
at 
org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:928)
at 
org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:954)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:65)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:459)
at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:415)
at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:189)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at 
org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:409)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at 
org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:133)
... 22 more
{code}


> NPE when select a field of a struct from a table stored by ORC
> --
>
> Key: HIVE-6631
> URL: https://issues.apache.org/jira/browse/HIVE-6631
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor, Serializers/Deserializers
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Yin Huai
>
> I have a table like this ...
> {code:sql}
> create table lineitem_orc_cg
> (
> CG1 STRUCTL_SUPPKEY:INT,
>L_COMMITDATE:STRING,
>L_RECEIPTDATE:STRING,
>L_SHIPINSTRUCT:STRING,
>L_SHIPMODE:STRING,
>L_COMMENT:STRING,
>L_TAX:float,
>L_RETURNFLAG:STRING,
>L_LINESTATUS:STRING,
>L_LINENUMBER:INT,
>L_ORDERKEY:INT>,
> CG2 STRUCTL_EXTENDEDPRICE:float,
>L_DISCOUNT:float,
>L_SHIPDATE:STRING>
> )
> row format serde 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> stored as orc tblproperties ("orc.compress"="NONE");
> {code}
> When I want to select a field from a struct by using
> {code:sql}
> select cg1.l_comment from lineitem_orc_cg limit 1;
> {code}
> I got 
> {code}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.ExprNodeFieldEvaluator.initialize(ExprNodeFieldEvaluator.java:61)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:928)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:954)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:65)
>   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Op

[jira] [Updated] (HIVE-6631) NPE when select a field of a struct from a table stored by ORC

2014-03-12 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6631:
---

Component/s: Serializers/Deserializers
 Query Processor

> NPE when select a field of a struct from a table stored by ORC
> --
>
> Key: HIVE-6631
> URL: https://issues.apache.org/jira/browse/HIVE-6631
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor, Serializers/Deserializers
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Yin Huai
>
> I have two tables lineitem_orc_cg
> {code:sql}
> create table lineitem_orc_cg
> (
> CG1 STRUCTL_SUPPKEY:INT,
>L_COMMITDATE:STRING,
>L_RECEIPTDATE:STRING,
>L_SHIPINSTRUCT:STRING,
>L_SHIPMODE:STRING,
>L_COMMENT:STRING,
>L_TAX:float,
>L_RETURNFLAG:STRING,
>L_LINESTATUS:STRING,
>L_LINENUMBER:INT,
>L_ORDERKEY:INT>,
> CG2 STRUCTL_EXTENDEDPRICE:float,
>L_DISCOUNT:float,
>L_SHIPDATE:STRING>
> )
> row format serde 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> stored as orc tblproperties ("orc.compress"="NONE");
> {code}
> When I want to select a field from a struct by using
> {code:sql}
> select cg1.l_comment from lineitem_orc_cg limit 1;
> {code}
> I got 
> {code}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.ExprNodeFieldEvaluator.initialize(ExprNodeFieldEvaluator.java:61)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:928)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:954)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:65)
>   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
>   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:459)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:415)
>   at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:189)
>   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:409)
>   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
>   at 
> org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:133)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HIVE-6631) NPE when select a field of a struct from a table stored by ORC

2014-03-12 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6631:
---

Affects Version/s: 0.14.0
   0.13.0

> NPE when select a field of a struct from a table stored by ORC
> --
>
> Key: HIVE-6631
> URL: https://issues.apache.org/jira/browse/HIVE-6631
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor, Serializers/Deserializers
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Yin Huai
>
> I have two tables lineitem_orc_cg
> {code:sql}
> create table lineitem_orc_cg
> (
> CG1 STRUCTL_SUPPKEY:INT,
>L_COMMITDATE:STRING,
>L_RECEIPTDATE:STRING,
>L_SHIPINSTRUCT:STRING,
>L_SHIPMODE:STRING,
>L_COMMENT:STRING,
>L_TAX:float,
>L_RETURNFLAG:STRING,
>L_LINESTATUS:STRING,
>L_LINENUMBER:INT,
>L_ORDERKEY:INT>,
> CG2 STRUCTL_EXTENDEDPRICE:float,
>L_DISCOUNT:float,
>L_SHIPDATE:STRING>
> )
> row format serde 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> stored as orc tblproperties ("orc.compress"="NONE");
> {code}
> When I want to select a field from a struct by using
> {code:sql}
> select cg1.l_comment from lineitem_orc_cg limit 1;
> {code}
> I got 
> {code}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.ExprNodeFieldEvaluator.initialize(ExprNodeFieldEvaluator.java:61)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:928)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:954)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:65)
>   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
>   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:459)
>   at 
> org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:415)
>   at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:189)
>   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:409)
>   at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
>   at 
> org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:133)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HIVE-6631) NPE when select a field of a struct from a table stored by ORC

2014-03-12 Thread Yin Huai (JIRA)
Yin Huai created HIVE-6631:
--

 Summary: NPE when select a field of a struct from a table stored 
by ORC
 Key: HIVE-6631
 URL: https://issues.apache.org/jira/browse/HIVE-6631
 Project: Hive
  Issue Type: Bug
Reporter: Yin Huai


I have two tables lineitem_orc_cg
{code:sql}
create table lineitem_orc_cg
(
CG1 STRUCT,
CG2 STRUCT
)
row format serde 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
stored as orc tblproperties ("orc.compress"="NONE");
{code}
When I want to select a field from a struct by using
{code:sql}
select cg1.l_comment from lineitem_orc_cg limit 1;
{code}

I got 
{code}
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.hive.ql.exec.ExprNodeFieldEvaluator.initialize(ExprNodeFieldEvaluator.java:61)
at 
org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:928)
at 
org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:954)
at 
org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:65)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:459)
at 
org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:415)
at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:189)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at 
org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:409)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:375)
at 
org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:133)
... 22 more
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HIVE-6163) OrcOutputFormat#getRecordWriter creates OrcRecordWriter with relative path

2014-02-05 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13892067#comment-13892067
 ] 

Yin Huai commented on HIVE-6163:


I think we should also check all other OutputFormats and make sure they all 
have the consistent behaviors on creating file paths. 

> OrcOutputFormat#getRecordWriter creates OrcRecordWriter with relative path
> --
>
> Key: HIVE-6163
> URL: https://issues.apache.org/jira/browse/HIVE-6163
> Project: Hive
>  Issue Type: Bug
>  Components: File Formats
>Affects Versions: 0.12.0
>Reporter: Branky Shao
>
> Hi,
> OrcOutputFormat#getRecordWriter creates OrcRecordWriter instance using a file 
> with relative path actually.
> return new OrcRecordWriter(new Path(name), OrcFile.writerOptions(conf));
> https://github.com/apache/hive/blob/7263b3bb1632b1a7c6ef5d2363e58020e1fdd756/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcOutputFormat.java#L114
> The fix should be very simple, just as RCFileOutputFormat#getRecordWriter, 
> append work output path as the parent:
> Path outputPath = getWorkOutputPath(job);
> Path file = new Path(outputPath, name);
> https://github.com/apache/hive/blob/d85eea2dc5decbf23e8f4010b32f1817cf057ea0/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileOutputFormat.java#L78



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2014-01-16 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873656#comment-13873656
 ] 

Yin Huai commented on HIVE-5945:


Committed to trunk. Thanks, Navis!

> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
> tables which are not used in the child of this conditional task.
> -
>
> Key: HIVE-5945
> URL: https://issues.apache.org/jira/browse/HIVE-5945
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Navis
>Priority: Critical
> Fix For: 0.13.0
>
> Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, 
> HIVE-5945.3.patch.txt, HIVE-5945.4.patch.txt, HIVE-5945.5.patch.txt, 
> HIVE-5945.6.patch.txt, HIVE-5945.7.patch.txt, HIVE-5945.8.patch.txt
>
>
> Here is an example
> {code}
> select
>i_item_id,
>s_state,
>avg(ss_quantity) agg1,
>avg(ss_list_price) agg2,
>avg(ss_coupon_amt) agg3,
>avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
> customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>cd_gender = 'F' and
>cd_marital_status = 'U' and
>cd_education_status = 'Primary' and
>d_year = 2002 and
>s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>i_item_id,
>s_state
> order by
>i_item_id,
>s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
> jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
> date_dim) and 3 MR job (for reduce joins.)
> So, I checked the conditional task determining the plan of the join involving 
> item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
> aliasToFileSizeMap contains all input tables used in this query and the 
> intermediate table generated by joining store_sales and date_dim. So, when we 
> sum the size of all small tables, the size of store_sales (which is around 
> 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2014-01-16 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

Release Note:   (was: Committed to trunk. Thanks, Navis!)

> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
> tables which are not used in the child of this conditional task.
> -
>
> Key: HIVE-5945
> URL: https://issues.apache.org/jira/browse/HIVE-5945
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Navis
>Priority: Critical
> Fix For: 0.13.0
>
> Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, 
> HIVE-5945.3.patch.txt, HIVE-5945.4.patch.txt, HIVE-5945.5.patch.txt, 
> HIVE-5945.6.patch.txt, HIVE-5945.7.patch.txt, HIVE-5945.8.patch.txt
>
>
> Here is an example
> {code}
> select
>i_item_id,
>s_state,
>avg(ss_quantity) agg1,
>avg(ss_list_price) agg2,
>avg(ss_coupon_amt) agg3,
>avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
> customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>cd_gender = 'F' and
>cd_marital_status = 'U' and
>cd_education_status = 'Primary' and
>d_year = 2002 and
>s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>i_item_id,
>s_state
> order by
>i_item_id,
>s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
> jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
> date_dim) and 3 MR job (for reduce joins.)
> So, I checked the conditional task determining the plan of the join involving 
> item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
> aliasToFileSizeMap contains all input tables used in this query and the 
> intermediate table generated by joining store_sales and date_dim. So, when we 
> sum the size of all small tables, the size of store_sales (which is around 
> 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2014-01-16 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

   Resolution: Fixed
Fix Version/s: 0.13.0
 Release Note: Committed to trunk. Thanks, Navis!
   Status: Resolved  (was: Patch Available)

> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
> tables which are not used in the child of this conditional task.
> -
>
> Key: HIVE-5945
> URL: https://issues.apache.org/jira/browse/HIVE-5945
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Navis
>Priority: Critical
> Fix For: 0.13.0
>
> Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, 
> HIVE-5945.3.patch.txt, HIVE-5945.4.patch.txt, HIVE-5945.5.patch.txt, 
> HIVE-5945.6.patch.txt, HIVE-5945.7.patch.txt, HIVE-5945.8.patch.txt
>
>
> Here is an example
> {code}
> select
>i_item_id,
>s_state,
>avg(ss_quantity) agg1,
>avg(ss_list_price) agg2,
>avg(ss_coupon_amt) agg3,
>avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
> customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>cd_gender = 'F' and
>cd_marital_status = 'U' and
>cd_education_status = 'Primary' and
>d_year = 2002 and
>s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>i_item_id,
>s_state
> order by
>i_item_id,
>s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
> jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
> date_dim) and 3 MR job (for reduce joins.)
> So, I checked the conditional task determining the plan of the join involving 
> item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
> aliasToFileSizeMap contains all input tables used in this query and the 
> intermediate table generated by joining store_sales and date_dim. So, when we 
> sum the size of all small tables, the size of store_sales (which is around 
> 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2014-01-15 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13872582#comment-13872582
 ] 

Yin Huai commented on HIVE-5945:


+1

> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
> tables which are not used in the child of this conditional task.
> -
>
> Key: HIVE-5945
> URL: https://issues.apache.org/jira/browse/HIVE-5945
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Navis
>Priority: Critical
> Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, 
> HIVE-5945.3.patch.txt, HIVE-5945.4.patch.txt, HIVE-5945.5.patch.txt, 
> HIVE-5945.6.patch.txt, HIVE-5945.7.patch.txt, HIVE-5945.8.patch.txt
>
>
> Here is an example
> {code}
> select
>i_item_id,
>s_state,
>avg(ss_quantity) agg1,
>avg(ss_list_price) agg2,
>avg(ss_coupon_amt) agg3,
>avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
> customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>cd_gender = 'F' and
>cd_marital_status = 'U' and
>cd_education_status = 'Primary' and
>d_year = 2002 and
>s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>i_item_id,
>s_state
> order by
>i_item_id,
>s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
> jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
> date_dim) and 3 MR job (for reduce joins.)
> So, I checked the conditional task determining the plan of the join involving 
> item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
> aliasToFileSizeMap contains all input tables used in this query and the 
> intermediate table generated by joining store_sales and date_dim. So, when we 
> sum the size of all small tables, the size of store_sales (which is around 
> 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2014-01-06 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863047#comment-13863047
 ] 

Yin Huai commented on HIVE-5945:


Thanks Navis for the change. date_dim is a native table. Actually, I think the 
problem is 
org.apache.hadoop.hive.ql.plan.ConditionalResolverCommonJoin.getParticipants. 
It uses ctx.getAliasToTask(); to get all aliases. However, these aliases do not 
include aliases appearing in the MapLocalWork (those small tables.). So for a 
query like 
{code}
set hive.auto.convert.join.noconditionaltask=false;
select
   i_item_id
FROM store_sales
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
limit 10;
{code}

The plan is 
{code}
STAGE DEPENDENCIES:
  Stage-5 is a root stage , consists of Stage-6, Stage-1
  Stage-6 has a backup stage: Stage-1
  Stage-3 depends on stages: Stage-6
  Stage-1
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-5
Conditional Operator

  Stage: Stage-6
Map Reduce Local Work
  Alias -> Map Local Tables:
item 
  Fetch Operator
limit: -1
  Alias -> Map Local Operator Tree:
item 
  TableScan
alias: item
HashTable Sink Operator
  condition expressions:
0 
1 {i_item_id}
  handleSkewJoin: false
  keys:
0 [Column[ss_item_sk]]
1 [Column[i_item_sk]]
  Position of Big Table: 0

  Stage: Stage-3
Map Reduce
  Alias -> Map Operator Tree:
store_sales 
  TableScan
alias: store_sales
Map Join Operator
  condition map:
   Inner Join 0 to 1
  condition expressions:
0 
1 {i_item_id}
  handleSkewJoin: false
  keys:
0 [Column[ss_item_sk]]
1 [Column[i_item_sk]]
  outputColumnNames: _col26
  Position of Big Table: 0
  Select Operator
expressions:
  expr: _col26
  type: string
outputColumnNames: _col0
Limit
  File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  Local Work:
Map Reduce Local Work

  Stage: Stage-1
Map Reduce
  Alias -> Map Operator Tree:
item 
  TableScan
alias: item
Reduce Output Operator
  key expressions:
expr: i_item_sk
type: int
  sort order: +
  Map-reduce partition columns:
expr: i_item_sk
type: int
  tag: 1
  value expressions:
expr: i_item_id
type: string
store_sales 
  TableScan
alias: store_sales
Reduce Output Operator
  key expressions:
expr: ss_item_sk
type: int
  sort order: +
  Map-reduce partition columns:
expr: ss_item_sk
type: int
  tag: 0
  Reduce Operator Tree:
Join Operator
  condition map:
   Inner Join 0 to 1
  condition expressions:
0 
1 {VALUE._col1}
  handleSkewJoin: false
  outputColumnNames: _col26
  Select Operator
expressions:
  expr: _col26
  type: string
outputColumnNames: _col0
Limit
  File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
Fetch Operator
  limit: 10
{code}
The alias of "item" will not be in the set returned by getParticipants. Thus, 
the input of sumOfExcept will be 
{code}
aliasToSize: {store_sales=388445409, item=5051899}
aliases: [store_sales]
except: store_sales
{code}
and then we get "0" for the size of small tables.

I think in getParticipants, we can check the type of a task and if it is a 
MapRedTask, we can use getWork().getMapWork().getMapLocalWork() to get the 
local task. Then, we can get aliases of those small tables through aliasToWork.

Another minor comment. Can you add a

[jira] [Updated] (HIVE-6083) User provided table properties are not assigned to the TableDesc of the FileSinkDesc in a CTAS query

2013-12-31 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6083:
---

Status: Patch Available  (was: Open)

> User provided table properties are not assigned to the TableDesc of the 
> FileSinkDesc in a CTAS query
> 
>
> Key: HIVE-6083
> URL: https://issues.apache.org/jira/browse/HIVE-6083
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-6083.1.patch.txt, HIVE-6083.2.patch.txt
>
>
> I was trying to use a CTAS query to create a table stored with ORC and 
> orc.compress was set to SNAPPY. However, the table was still compressed as 
> ZLIB (although the result of DESCRIBE still shows that this table is 
> compressed by SNAPPY). For a CTAS query, SemanticAnalyzer.genFileSinkPlan 
> uses CreateTableDesc to generate the TableDesc for the FileSinkDesc by 
> calling PlanUtils.getTableDesc. However, in PlanUtils.getTableDesc, I do not 
> see user provided table properties are assigned to the returned TableDesc 
> (CreateTableDesc.getTblProps was not called in this method ).  
> btw, I only checked the code of 0.12 and trunk.
> Two examples:
> * Snappy compression
> {code}
> create table web_sales_wrong_orc_snappy
> stored as orc tblproperties ("orc.compress"="SNAPPY")
> as select * from web_sales;
> {code}
> {code}
> describe formatted web_sales_wrong_orc_snappy;
> 
> Location: 
> hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_snappy
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   COLUMN_STATS_ACCURATE   true
>   numFiles1   
>   numRows 719384  
>   orc.compressSNAPPY  
>   rawDataSize 97815412
>   totalSize   40625243
>   transient_lastDdlTime   1387566015   
>    
> {code}
> {code}
> bin/hive --orcfiledump 
> /user/hive/warehouse/web_sales_wrong_orc_snappy/00_0
> Rows: 719384
> Compression: ZLIB
> Compression size: 262144
> ...
> {code}
> * No compression
> {code}
> create table web_sales_wrong_orc_none
> stored as orc tblproperties ("orc.compress"="NONE")
> as select * from web_sales;
> {code}
> {code}
> describe formatted web_sales_wrong_orc_none;
> 
> Location: 
> hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_none  
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   COLUMN_STATS_ACCURATE   true
>   numFiles1   
>   numRows 719384  
>   orc.compressNONE
>   rawDataSize 97815412
>   totalSize   40625243
>   transient_lastDdlTime   1387566064   
>    
> {code}
> {code}
> bin/hive --orcfiledump /user/hive/warehouse/web_sales_wrong_orc_none/00_0
> Rows: 719384
> Compression: ZLIB
> Compression size: 262144
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HIVE-6083) User provided table properties are not assigned to the TableDesc of the FileSinkDesc in a CTAS query

2013-12-31 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6083:
---

Status: Open  (was: Patch Available)

> User provided table properties are not assigned to the TableDesc of the 
> FileSinkDesc in a CTAS query
> 
>
> Key: HIVE-6083
> URL: https://issues.apache.org/jira/browse/HIVE-6083
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-6083.1.patch.txt, HIVE-6083.2.patch.txt
>
>
> I was trying to use a CTAS query to create a table stored with ORC and 
> orc.compress was set to SNAPPY. However, the table was still compressed as 
> ZLIB (although the result of DESCRIBE still shows that this table is 
> compressed by SNAPPY). For a CTAS query, SemanticAnalyzer.genFileSinkPlan 
> uses CreateTableDesc to generate the TableDesc for the FileSinkDesc by 
> calling PlanUtils.getTableDesc. However, in PlanUtils.getTableDesc, I do not 
> see user provided table properties are assigned to the returned TableDesc 
> (CreateTableDesc.getTblProps was not called in this method ).  
> btw, I only checked the code of 0.12 and trunk.
> Two examples:
> * Snappy compression
> {code}
> create table web_sales_wrong_orc_snappy
> stored as orc tblproperties ("orc.compress"="SNAPPY")
> as select * from web_sales;
> {code}
> {code}
> describe formatted web_sales_wrong_orc_snappy;
> 
> Location: 
> hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_snappy
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   COLUMN_STATS_ACCURATE   true
>   numFiles1   
>   numRows 719384  
>   orc.compressSNAPPY  
>   rawDataSize 97815412
>   totalSize   40625243
>   transient_lastDdlTime   1387566015   
>    
> {code}
> {code}
> bin/hive --orcfiledump 
> /user/hive/warehouse/web_sales_wrong_orc_snappy/00_0
> Rows: 719384
> Compression: ZLIB
> Compression size: 262144
> ...
> {code}
> * No compression
> {code}
> create table web_sales_wrong_orc_none
> stored as orc tblproperties ("orc.compress"="NONE")
> as select * from web_sales;
> {code}
> {code}
> describe formatted web_sales_wrong_orc_none;
> 
> Location: 
> hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_none  
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   COLUMN_STATS_ACCURATE   true
>   numFiles1   
>   numRows 719384  
>   orc.compressNONE
>   rawDataSize 97815412
>   totalSize   40625243
>   transient_lastDdlTime   1387566064   
>    
> {code}
> {code}
> bin/hive --orcfiledump /user/hive/warehouse/web_sales_wrong_orc_none/00_0
> Rows: 719384
> Compression: ZLIB
> Compression size: 262144
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HIVE-6083) User provided table properties are not assigned to the TableDesc of the FileSinkDesc in a CTAS query

2013-12-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6083:
---

Status: Patch Available  (was: Open)

> User provided table properties are not assigned to the TableDesc of the 
> FileSinkDesc in a CTAS query
> 
>
> Key: HIVE-6083
> URL: https://issues.apache.org/jira/browse/HIVE-6083
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-6083.1.patch.txt, HIVE-6083.2.patch.txt
>
>
> I was trying to use a CTAS query to create a table stored with ORC and 
> orc.compress was set to SNAPPY. However, the table was still compressed as 
> ZLIB (although the result of DESCRIBE still shows that this table is 
> compressed by SNAPPY). For a CTAS query, SemanticAnalyzer.genFileSinkPlan 
> uses CreateTableDesc to generate the TableDesc for the FileSinkDesc by 
> calling PlanUtils.getTableDesc. However, in PlanUtils.getTableDesc, I do not 
> see user provided table properties are assigned to the returned TableDesc 
> (CreateTableDesc.getTblProps was not called in this method ).  
> btw, I only checked the code of 0.12 and trunk.
> Two examples:
> * Snappy compression
> {code}
> create table web_sales_wrong_orc_snappy
> stored as orc tblproperties ("orc.compress"="SNAPPY")
> as select * from web_sales;
> {code}
> {code}
> describe formatted web_sales_wrong_orc_snappy;
> 
> Location: 
> hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_snappy
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   COLUMN_STATS_ACCURATE   true
>   numFiles1   
>   numRows 719384  
>   orc.compressSNAPPY  
>   rawDataSize 97815412
>   totalSize   40625243
>   transient_lastDdlTime   1387566015   
>    
> {code}
> {code}
> bin/hive --orcfiledump 
> /user/hive/warehouse/web_sales_wrong_orc_snappy/00_0
> Rows: 719384
> Compression: ZLIB
> Compression size: 262144
> ...
> {code}
> * No compression
> {code}
> create table web_sales_wrong_orc_none
> stored as orc tblproperties ("orc.compress"="NONE")
> as select * from web_sales;
> {code}
> {code}
> describe formatted web_sales_wrong_orc_none;
> 
> Location: 
> hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_none  
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   COLUMN_STATS_ACCURATE   true
>   numFiles1   
>   numRows 719384  
>   orc.compressNONE
>   rawDataSize 97815412
>   totalSize   40625243
>   transient_lastDdlTime   1387566064   
>    
> {code}
> {code}
> bin/hive --orcfiledump /user/hive/warehouse/web_sales_wrong_orc_none/00_0
> Rows: 719384
> Compression: ZLIB
> Compression size: 262144
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HIVE-6083) User provided table properties are not assigned to the TableDesc of the FileSinkDesc in a CTAS query

2013-12-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6083:
---

Attachment: HIVE-6083.2.patch.txt

Let me trigger HiveQA again.

> User provided table properties are not assigned to the TableDesc of the 
> FileSinkDesc in a CTAS query
> 
>
> Key: HIVE-6083
> URL: https://issues.apache.org/jira/browse/HIVE-6083
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-6083.1.patch.txt, HIVE-6083.2.patch.txt
>
>
> I was trying to use a CTAS query to create a table stored with ORC and 
> orc.compress was set to SNAPPY. However, the table was still compressed as 
> ZLIB (although the result of DESCRIBE still shows that this table is 
> compressed by SNAPPY). For a CTAS query, SemanticAnalyzer.genFileSinkPlan 
> uses CreateTableDesc to generate the TableDesc for the FileSinkDesc by 
> calling PlanUtils.getTableDesc. However, in PlanUtils.getTableDesc, I do not 
> see user provided table properties are assigned to the returned TableDesc 
> (CreateTableDesc.getTblProps was not called in this method ).  
> btw, I only checked the code of 0.12 and trunk.
> Two examples:
> * Snappy compression
> {code}
> create table web_sales_wrong_orc_snappy
> stored as orc tblproperties ("orc.compress"="SNAPPY")
> as select * from web_sales;
> {code}
> {code}
> describe formatted web_sales_wrong_orc_snappy;
> 
> Location: 
> hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_snappy
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   COLUMN_STATS_ACCURATE   true
>   numFiles1   
>   numRows 719384  
>   orc.compressSNAPPY  
>   rawDataSize 97815412
>   totalSize   40625243
>   transient_lastDdlTime   1387566015   
>    
> {code}
> {code}
> bin/hive --orcfiledump 
> /user/hive/warehouse/web_sales_wrong_orc_snappy/00_0
> Rows: 719384
> Compression: ZLIB
> Compression size: 262144
> ...
> {code}
> * No compression
> {code}
> create table web_sales_wrong_orc_none
> stored as orc tblproperties ("orc.compress"="NONE")
> as select * from web_sales;
> {code}
> {code}
> describe formatted web_sales_wrong_orc_none;
> 
> Location: 
> hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_none  
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   COLUMN_STATS_ACCURATE   true
>   numFiles1   
>   numRows 719384  
>   orc.compressNONE
>   rawDataSize 97815412
>   totalSize   40625243
>   transient_lastDdlTime   1387566064   
>    
> {code}
> {code}
> bin/hive --orcfiledump /user/hive/warehouse/web_sales_wrong_orc_none/00_0
> Rows: 719384
> Compression: ZLIB
> Compression size: 262144
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (HIVE-6083) User provided table properties are not assigned to the TableDesc of the FileSinkDesc in a CTAS query

2013-12-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6083:
---

Status: Open  (was: Patch Available)

> User provided table properties are not assigned to the TableDesc of the 
> FileSinkDesc in a CTAS query
> 
>
> Key: HIVE-6083
> URL: https://issues.apache.org/jira/browse/HIVE-6083
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-6083.1.patch.txt
>
>
> I was trying to use a CTAS query to create a table stored with ORC and 
> orc.compress was set to SNAPPY. However, the table was still compressed as 
> ZLIB (although the result of DESCRIBE still shows that this table is 
> compressed by SNAPPY). For a CTAS query, SemanticAnalyzer.genFileSinkPlan 
> uses CreateTableDesc to generate the TableDesc for the FileSinkDesc by 
> calling PlanUtils.getTableDesc. However, in PlanUtils.getTableDesc, I do not 
> see user provided table properties are assigned to the returned TableDesc 
> (CreateTableDesc.getTblProps was not called in this method ).  
> btw, I only checked the code of 0.12 and trunk.
> Two examples:
> * Snappy compression
> {code}
> create table web_sales_wrong_orc_snappy
> stored as orc tblproperties ("orc.compress"="SNAPPY")
> as select * from web_sales;
> {code}
> {code}
> describe formatted web_sales_wrong_orc_snappy;
> 
> Location: 
> hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_snappy
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   COLUMN_STATS_ACCURATE   true
>   numFiles1   
>   numRows 719384  
>   orc.compressSNAPPY  
>   rawDataSize 97815412
>   totalSize   40625243
>   transient_lastDdlTime   1387566015   
>    
> {code}
> {code}
> bin/hive --orcfiledump 
> /user/hive/warehouse/web_sales_wrong_orc_snappy/00_0
> Rows: 719384
> Compression: ZLIB
> Compression size: 262144
> ...
> {code}
> * No compression
> {code}
> create table web_sales_wrong_orc_none
> stored as orc tblproperties ("orc.compress"="NONE")
> as select * from web_sales;
> {code}
> {code}
> describe formatted web_sales_wrong_orc_none;
> 
> Location: 
> hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_none  
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   COLUMN_STATS_ACCURATE   true
>   numFiles1   
>   numRows 719384  
>   orc.compressNONE
>   rawDataSize 97815412
>   totalSize   40625243
>   transient_lastDdlTime   1387566064   
>    
> {code}
> {code}
> bin/hive --orcfiledump /user/hive/warehouse/web_sales_wrong_orc_none/00_0
> Rows: 719384
> Compression: ZLIB
> Compression size: 262144
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2013-12-30 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13859114#comment-13859114
 ] 

Yin Huai commented on HIVE-5945:


Thanks Navis :) I played with your patch and found a issue which I commented at 
the review board. I am also attaching more info at here. For the query in the 
description, we can have 4 map-joins. There will be 3 different intermediate 
tables called $INTNAME. The current patch does not update the size of $INTNAME.

Here are logs.
{code}
13/12/30 16:48:25 INFO ql.Driver: MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 12.76 sec   HDFS Read: 388445624 HDFS Write: 
20815654 SUCCESS
13/12/30 16:48:25 INFO ql.Driver: Job 0: Map: 1   Cumulative CPU: 12.76 sec   
HDFS Read: 388445624 HDFS Write: 20815654 SUCCESS
Job 1: Map: 1   Cumulative CPU: 9.18 sec   HDFS Read: 20816111 HDFS Write: 
28593993 SUCCESS
13/12/30 16:48:25 INFO ql.Driver: Job 1: Map: 1   Cumulative CPU: 9.18 sec   
HDFS Read: 20816111 HDFS Write: 28593993 SUCCESS
Job 2: Map: 1   Cumulative CPU: 17.38 sec   HDFS Read: 80660331 HDFS Write: 
378063 SUCCESS
13/12/30 16:48:25 INFO ql.Driver: Job 2: Map: 1   Cumulative CPU: 17.38 sec   
HDFS Read: 80660331 HDFS Write: 378063 SUCCESS
Job 3: Map: 1   Cumulative CPU: 2.06 sec   HDFS Read: 378520 HDFS Write: 96 
SUCCESS
13/12/30 16:48:25 INFO ql.Driver: Job 3: Map: 1   Cumulative CPU: 2.06 sec   
HDFS Read: 378520 HDFS Write: 96 SUCCESS
Job 4: Map: 1  Reduce: 1   Cumulative CPU: 2.45 sec   HDFS Read: 553 HDFS 
Write: 96 SUCCESS
13/12/30 16:48:25 INFO ql.Driver: Job 4: Map: 1  Reduce: 1   Cumulative CPU: 
2.45 sec   HDFS Read: 553 HDFS Write: 96 SUCCESS
Job 5: Map: 1  Reduce: 1   Cumulative CPU: 2.33 sec   HDFS Read: 553 HDFS 
Write: 0 SUCCESS
13/12/30 16:48:25 INFO ql.Driver: Job 5: Map: 1  Reduce: 1   Cumulative CPU: 
2.33 sec   HDFS Read: 553 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 46 seconds 160 msec
{code}

{code}
Map-join1:
plan.ConditionalResolverCommonJoin: Driver alias is store_sales with size 
388445409 (total size of others : 0, threshold : 2500)
Stage-28 is selected by condition resolver.

Map-join2:
plan.ConditionalResolverCommonJoin: Driver alias is $INTNAME with size 20815654 
(total size of others : 5051899, threshold : 2500)
Stage-26 is selected by condition resolver.

Map-join3:
 plan.ConditionalResolverCommonJoin: Driver alias is customer_demographics with 
size 80660096 (total size of others : 20815654, threshold : 2500)
Stage-24 is filtered out by condition resolver.

Map-join4:
plan.ConditionalResolverCommonJoin: Driver alias is $INTNAME with size 20815654 
(total size of others : 3155, threshold : 2500)
Stage-22 is selected by condition resolver.
{code}


btw, a minor question. Why the log of map-join 1 shows the size of others 0?

> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
> tables which are not used in the child of this conditional task.
> -
>
> Key: HIVE-5945
> URL: https://issues.apache.org/jira/browse/HIVE-5945
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Navis
>Priority: Critical
> Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, 
> HIVE-5945.3.patch.txt, HIVE-5945.4.patch.txt, HIVE-5945.5.patch.txt
>
>
> Here is an example
> {code}
> select
>i_item_id,
>s_state,
>avg(ss_quantity) agg1,
>avg(ss_list_price) agg2,
>avg(ss_coupon_amt) agg3,
>avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
> customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>cd_gender = 'F' and
>cd_marital_status = 'U' and
>cd_education_status = 'Primary' and
>d_year = 2002 and
>s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>i_item_id,
>s_state
> order by
>i_item_id,
>s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
> jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
> date_dim) and 3 MR job (for reduce joins.)
> So, I checked the conditional task determining the plan of the join involving 
> item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
> aliasToFileSizeMap contains all input tables used in this query and the 
> intermediate table generated by joining store_sales and date_dim. So, when we 
> sum the size of all small tables, the siz

[jira] [Commented] (HIVE-6083) User provided table properties are not assigned to the TableDesc of the FileSinkDesc in a CTAS query

2013-12-20 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13854449#comment-13854449
 ] 

Yin Huai commented on HIVE-6083:


With .1 patch ...
* Snappy compression
{code}
create table web_sales_correct_orc_snappy
stored as orc tblproperties ("orc.compress"="SNAPPY")
as select * from web_sales;
{code}
{code}
describe formatted web_sales_correct_orc_snappy;

Location:   
hdfs://localhost:54310/user/hive/warehouse/web_sales_correct_orc_snappy  
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE   true
numFiles1   
numRows 719384  
orc.compressSNAPPY  
rawDataSize 97815412
totalSize   51042245
transient_lastDdlTime   1387566737  
   
{code}
{code}
bin/hive --orcfiledump 
/user/hive/warehouse/web_sales_correct_orc_snappy/00_0
Rows: 719384
Compression: SNAPPY
Compression size: 262144
...
{code}
* No compression
{code}
create table web_sales_correct_orc_none
stored as orc tblproperties ("orc.compress"="NONE")
as select * from web_sales;
{code}
{code}
describe formatted web_sales_correct_orc_none;

Location:   
hdfs://localhost:54310/user/hive/warehouse/web_sales_correct_orc_none
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE   true
numFiles1   
numRows 719384  
orc.compressNONE
rawDataSize 97815412
totalSize   53968823
transient_lastDdlTime   1387566788 
   
{code}
{code}
bin/hive --orcfiledump /user/hive/warehouse/web_sales_correct_orc_none/00_0
Rows: 719384
Compression: NONE
...
{code}

> User provided table properties are not assigned to the TableDesc of the 
> FileSinkDesc in a CTAS query
> 
>
> Key: HIVE-6083
> URL: https://issues.apache.org/jira/browse/HIVE-6083
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-6083.1.patch.txt
>
>
> I was trying to use a CTAS query to create a table stored with ORC and 
> orc.compress was set to SNAPPY. However, the table was still compressed as 
> ZLIB (although the result of DESCRIBE still shows that this table is 
> compressed by SNAPPY). For a CTAS query, SemanticAnalyzer.genFileSinkPlan 
> uses CreateTableDesc to generate the TableDesc for the FileSinkDesc by 
> calling PlanUtils.getTableDesc. However, in PlanUtils.getTableDesc, I do not 
> see user provided table properties are assigned to the returned TableDesc 
> (CreateTableDesc.getTblProps was not called in this method ).  
> btw, I only checked the code of 0.12 and trunk.
> Two examples:
> * Snappy compression
> {code}
> create table web_sales_wrong_orc_snappy
> stored as orc tblproperties ("orc.compress"="SNAPPY")
> as select * from web_sales;
> {code}
> {code}
> describe formatted web_sales_wrong_orc_snappy;
> 
> Location: 
> hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_snappy
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   COLUMN_STATS_ACCURATE   true
>   numFiles1   
>   numRows 719384  
>   orc.compressSNAPPY  
>   rawDataSize 97815412
>   totalSize   40625243
>   transient_lastDdlTime   1387566015   
>    
> {code}
> {code}
> bin/hive --orcfiledump 
> /user/hive/warehouse/web_sales_wrong_orc_snappy/00_0
> Rows: 719384
> Compression: ZLIB
> Compression size: 262144
> ...
> {code}
> * No compression
> {code}
> create table web_sales_wrong_orc_none
> stored as orc tblproperties ("orc.compress"="NONE")
> as select * from web_sales;
> {code}
> {code}
> describe formatted web_sales_wrong_orc_none;
> 
> Location: 
> hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_none  
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   COLUMN_STATS_ACCURATE   true
>   numFiles1   
>   numRows 719384  
>   orc.compressNONE
>   rawDataSize 97815412
>   totalSize   40625243

[jira] [Updated] (HIVE-6083) User provided table properties are not assigned to the TableDesc of the FileSinkDesc in a CTAS query

2013-12-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6083:
---

Description: 
I was trying to use a CTAS query to create a table stored with ORC and 
orc.compress was set to SNAPPY. However, the table was still compressed as ZLIB 
(although the result of DESCRIBE still shows that this table is compressed by 
SNAPPY). For a CTAS query, SemanticAnalyzer.genFileSinkPlan uses 
CreateTableDesc to generate the TableDesc for the FileSinkDesc by calling 
PlanUtils.getTableDesc. However, in PlanUtils.getTableDesc, I do not see user 
provided table properties are assigned to the returned TableDesc 
(CreateTableDesc.getTblProps was not called in this method ).  

btw, I only checked the code of 0.12 and trunk.

Two examples:
* Snappy compression
{code}
create table web_sales_wrong_orc_snappy
stored as orc tblproperties ("orc.compress"="SNAPPY")
as select * from web_sales;
{code}
{code}
describe formatted web_sales_wrong_orc_snappy;

Location:   
hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_snappy
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE   true
numFiles1   
numRows 719384  
orc.compressSNAPPY  
rawDataSize 97815412
totalSize   40625243
transient_lastDdlTime   1387566015   
   
{code}
{code}
bin/hive --orcfiledump /user/hive/warehouse/web_sales_wrong_orc_snappy/00_0
Rows: 719384
Compression: ZLIB
Compression size: 262144
...
{code}
* No compression
{code}
create table web_sales_wrong_orc_none
stored as orc tblproperties ("orc.compress"="NONE")
as select * from web_sales;
{code}
{code}
describe formatted web_sales_wrong_orc_none;

Location:   
hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_none  
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE   true
numFiles1   
numRows 719384  
orc.compressNONE
rawDataSize 97815412
totalSize   40625243
transient_lastDdlTime   1387566064   
   
{code}
{code}
bin/hive --orcfiledump /user/hive/warehouse/web_sales_wrong_orc_none/00_0
Rows: 719384
Compression: ZLIB
Compression size: 262144
...
{code}

  was:
I was trying to use a CTAS query to create a table stored with ORC and 
orc.compress was set to SNAPPY. However, the table was still compressed as ZLIB 
(although the result of DESCRIBE still shows that this table is compressed by 
SNAPPY). For a CTAS query, SemanticAnalyzer.genFileSinkPlan uses 
CreateTableDesc to generate the TableDesc for the FileSinkDesc by calling 
PlanUtils.getTableDesc. However, in PlanUtils.getTableDesc, I do not see user 
provided table properties are assigned to the returned TableDesc 
(CreateTableDesc.getTblProps was not called in this method ).  

btw, I only checked the code of 0.12 and trunk.


> User provided table properties are not assigned to the TableDesc of the 
> FileSinkDesc in a CTAS query
> 
>
> Key: HIVE-6083
> URL: https://issues.apache.org/jira/browse/HIVE-6083
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-6083.1.patch.txt
>
>
> I was trying to use a CTAS query to create a table stored with ORC and 
> orc.compress was set to SNAPPY. However, the table was still compressed as 
> ZLIB (although the result of DESCRIBE still shows that this table is 
> compressed by SNAPPY). For a CTAS query, SemanticAnalyzer.genFileSinkPlan 
> uses CreateTableDesc to generate the TableDesc for the FileSinkDesc by 
> calling PlanUtils.getTableDesc. However, in PlanUtils.getTableDesc, I do not 
> see user provided table properties are assigned to the returned TableDesc 
> (CreateTableDesc.getTblProps was not called in this method ).  
> btw, I only checked the code of 0.12 and trunk.
> Two examples:
> * Snappy compression
> {code}
> create table web_sales_wrong_orc_snappy
> stored as orc tblproperties ("orc.compress"="SNAPPY")
> as select * from web_sales;
> {code}
> {code}
> describe formatted web_sales_wrong_orc_snappy;
> 
> Location: 
> hdfs://localhost:54310/user/hive/warehouse/web_sales_wrong_orc_snappy
> Table Type:   MANAGED_TABLE
> Table Parameters:  
>   COLUMN_STATS_ACCURATE   true  

[jira] [Updated] (HIVE-6083) User provided table properties are not assigned to the TableDesc of the FileSinkDesc in a CTAS query

2013-12-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6083:
---

Summary: User provided table properties are not assigned to the TableDesc 
of the FileSinkDesc in a CTAS query  (was: User provided table properties are 
not assigned to the TableDesc of the FileSinkDesc in a CTAS)

> User provided table properties are not assigned to the TableDesc of the 
> FileSinkDesc in a CTAS query
> 
>
> Key: HIVE-6083
> URL: https://issues.apache.org/jira/browse/HIVE-6083
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-6083.1.patch.txt
>
>
> I was trying to use a CTAS query to create a table stored with ORC and 
> orc.compress was set to SNAPPY. However, the table was still compressed as 
> ZLIB (although the result of DESCRIBE still shows that this table is 
> compressed by SNAPPY). For a CTAS query, SemanticAnalyzer.genFileSinkPlan 
> uses CreateTableDesc to generate the TableDesc for the FileSinkDesc by 
> calling PlanUtils.getTableDesc. However, in PlanUtils.getTableDesc, I do not 
> see user provided table properties are assigned to the returned TableDesc 
> (CreateTableDesc.getTblProps was not called in this method ).  
> btw, I only checked the code of 0.12 and trunk.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Assigned] (HIVE-6083) User provided table properties are not assigned to the TableDesc of the FileSinkDesc in a CTAS

2013-12-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned HIVE-6083:
--

Assignee: Yin Huai

> User provided table properties are not assigned to the TableDesc of the 
> FileSinkDesc in a CTAS
> --
>
> Key: HIVE-6083
> URL: https://issues.apache.org/jira/browse/HIVE-6083
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-6083.1.patch.txt
>
>
> I was trying to use a CTAS query to create a table stored with ORC and 
> orc.compress was set to SNAPPY. However, the table was still compressed as 
> ZLIB (although the result of DESCRIBE still shows that this table is 
> compressed by SNAPPY). For a CTAS query, SemanticAnalyzer.genFileSinkPlan 
> uses CreateTableDesc to generate the TableDesc for the FileSinkDesc by 
> calling PlanUtils.getTableDesc. However, in PlanUtils.getTableDesc, I do not 
> see user provided table properties are assigned to the returned TableDesc 
> (CreateTableDesc.getTblProps was not called in this method ).  
> btw, I only checked the code of 0.12 and trunk.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HIVE-6083) User provided table properties are not assigned to the TableDesc of the FileSinkDesc in a CTAS

2013-12-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6083:
---

Status: Patch Available  (was: Open)

> User provided table properties are not assigned to the TableDesc of the 
> FileSinkDesc in a CTAS
> --
>
> Key: HIVE-6083
> URL: https://issues.apache.org/jira/browse/HIVE-6083
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
> Attachments: HIVE-6083.1.patch.txt
>
>
> I was trying to use a CTAS query to create a table stored with ORC and 
> orc.compress was set to SNAPPY. However, the table was still compressed as 
> ZLIB (although the result of DESCRIBE still shows that this table is 
> compressed by SNAPPY). For a CTAS query, SemanticAnalyzer.genFileSinkPlan 
> uses CreateTableDesc to generate the TableDesc for the FileSinkDesc by 
> calling PlanUtils.getTableDesc. However, in PlanUtils.getTableDesc, I do not 
> see user provided table properties are assigned to the returned TableDesc 
> (CreateTableDesc.getTblProps was not called in this method ).  
> btw, I only checked the code of 0.12 and trunk.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HIVE-6083) User provided table properties are not assigned to the TableDesc of the FileSinkDesc in a CTAS

2013-12-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6083:
---

Attachment: HIVE-6083.1.patch.txt

an initial patch. Let me trigger hive qa to see if any test case is affected. I 
am also thinking about how to test it...

> User provided table properties are not assigned to the TableDesc of the 
> FileSinkDesc in a CTAS
> --
>
> Key: HIVE-6083
> URL: https://issues.apache.org/jira/browse/HIVE-6083
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
> Attachments: HIVE-6083.1.patch.txt
>
>
> I was trying to use a CTAS query to create a table stored with ORC and 
> orc.compress was set to SNAPPY. However, the table was still compressed as 
> ZLIB (although the result of DESCRIBE still shows that this table is 
> compressed by SNAPPY). For a CTAS query, SemanticAnalyzer.genFileSinkPlan 
> uses CreateTableDesc to generate the TableDesc for the FileSinkDesc by 
> calling PlanUtils.getTableDesc. However, in PlanUtils.getTableDesc, I do not 
> see user provided table properties are assigned to the returned TableDesc 
> (CreateTableDesc.getTblProps was not called in this method ).  
> btw, I only checked the code of 0.12 and trunk.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Created] (HIVE-6083) User provided table properties are not assigned to the TableDesc of the FileSinkDesc in a CTAS

2013-12-20 Thread Yin Huai (JIRA)
Yin Huai created HIVE-6083:
--

 Summary: User provided table properties are not assigned to the 
TableDesc of the FileSinkDesc in a CTAS
 Key: HIVE-6083
 URL: https://issues.apache.org/jira/browse/HIVE-6083
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.12.0, 0.13.0
Reporter: Yin Huai


I was trying to use a CTAS query to create a table stored with ORC and 
orc.compress was set to SNAPPY. However, the table was still compressed as ZLIB 
(although the result of DESCRIBE still shows that this table is compressed by 
SNAPPY). For a CTAS query, SemanticAnalyzer.genFileSinkPlan uses 
CreateTableDesc to generate the TableDesc for the FileSinkDesc by calling 
PlanUtils.getTableDesc. However, in PlanUtils.getTableDesc, I do not see user 
provided table properties are assigned to the returned TableDesc 
(CreateTableDesc.getTblProps was not called in this method ).  

btw, I only checked the code of 0.12 and trunk.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HIVE-5891) Alias conflict when merging multiple mapjoin tasks into their common child mapred task

2013-12-20 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13854022#comment-13854022
 ] 

Yin Huai commented on HIVE-5891:


Thanks [~sunrui]. LGTM. I left two minor comments on the review board.

> Alias conflict when merging multiple mapjoin tasks into their common child 
> mapred task
> --
>
> Key: HIVE-5891
> URL: https://issues.apache.org/jira/browse/HIVE-5891
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.12.0
>Reporter: Sun Rui
>Assignee: Sun Rui
> Attachments: HIVE-5891.1.patch, HIVE-5891.2.patch
>
>
> Use the following test case with HIVE 0.12:
> {code:sql}
> create table src(key int, value string);
> load data local inpath 'src/data/files/kv1.txt' overwrite into table src;
> select * from (
>   select c.key from
> (select a.key from src a join src b on a.key=b.key group by a.key) tmp
> join src c on tmp.key=c.key
>   union all
>   select c.key from
> (select a.key from src a join src b on a.key=b.key group by a.key) tmp
> join src c on tmp.key=c.key
> ) x;
> {code}
> We will get a NullPointerException from Union Operator:
> {noformat}
> java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> Hive Runtime Error while processing row {"_col0":0}
>   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:175)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row {"_col0":0}
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:544)
>   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157)
>   ... 4 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.UnionOperator.processOp(UnionOperator.java:120)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:652)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:655)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:758)
>   at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:220)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
>   ... 5 more
> {noformat}
>   
> The root cause is in 
> CommonJoinTaskDispatcher.mergeMapJoinTaskIntoItsChildMapRedTask().
> {noformat}
>   +--+  +--+
>   | MapJoin task |  | MapJoin task |
>   +--+  +--+
>  \ /
>   \   /
>  +--+
>  |  Union task  |
>  +--+
> {noformat} 
> CommonJoinTaskDispatcher merges the two MapJoin tasks into their common 
> child: Union task. The two MapJoin tasks have the same alias name for their 
> big tables: $INTNAME, which is the name of the temporary table of a join 
> stream. The aliasToWork map uses alias as key, so eventually only the MapJoin 
> operator tree of one MapJoin task is saved into the aliasToWork map of the 
> Union task, while the MapJoin operator tree of another MapJoin task is lost. 
> As a result, Union operator won't be initialized because not all of its 
> parents gets intialized (The Union operator itself indicates it has two 
> parents, but actually it has only 1 parent because another parent is lost).
> This issue does not exist in HIVE 0.11 and thus is a regression bug in HIVE 
> 0.12.
> The propsed solution is to use the query ID as pref

[jira] [Commented] (HIVE-5891) Alias conflict when merging multiple mapjoin tasks into their common child mapred task

2013-12-19 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13853583#comment-13853583
 ] 

Yin Huai commented on HIVE-5891:


i see. yes, seems getMapJoinContext and getSmbMapJoinContext can also have 
QBJoinTrees. I think it will be good to show meaningful aliases for those 
intermediate results. So, users can know where does an intermediate result come 
from. Since it is not easy to get the correct QB.id, I prefer to use 
QBJoinTree.id right now. Once this bug has been fixed, we can work on a 
followup jira to get rid of INTNAME. Also, I guess that we do not have an unit 
test to cover this bug. Can you add an test query in multiMapJoin2.q and 
comment the reason that we need this test? Thanks.

> Alias conflict when merging multiple mapjoin tasks into their common child 
> mapred task
> --
>
> Key: HIVE-5891
> URL: https://issues.apache.org/jira/browse/HIVE-5891
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.12.0
>Reporter: Sun Rui
>Assignee: Sun Rui
> Attachments: HIVE-5891.1.patch
>
>
> Use the following test case with HIVE 0.12:
> {code:sql}
> create table src(key int, value string);
> load data local inpath 'src/data/files/kv1.txt' overwrite into table src;
> select * from (
>   select c.key from
> (select a.key from src a join src b on a.key=b.key group by a.key) tmp
> join src c on tmp.key=c.key
>   union all
>   select c.key from
> (select a.key from src a join src b on a.key=b.key group by a.key) tmp
> join src c on tmp.key=c.key
> ) x;
> {code}
> We will get a NullPointerException from Union Operator:
> {noformat}
> java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> Hive Runtime Error while processing row {"_col0":0}
>   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:175)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row {"_col0":0}
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:544)
>   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157)
>   ... 4 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.UnionOperator.processOp(UnionOperator.java:120)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:652)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:655)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:758)
>   at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:220)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
>   ... 5 more
> {noformat}
>   
> The root cause is in 
> CommonJoinTaskDispatcher.mergeMapJoinTaskIntoItsChildMapRedTask().
> {noformat}
>   +--+  +--+
>   | MapJoin task |  | MapJoin task |
>   +--+  +--+
>  \ /
>   \   /
>  +--+
>  |  Union task  |
>  +--+
> {noformat} 
> CommonJoinTaskDispatcher merges the two MapJoin tasks into their common 
> child: Union task. The two MapJoin tasks have the same alias name for their 
> big tables: $INTNAME, which is the name of the temporary table of a join 
> stream. The aliasToWork map uses alias as key, so eventually only the MapJoin 
> operator tree of one MapJoin task 

[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2013-12-18 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

Status: Open  (was: Patch Available)

> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
> tables which are not used in the child of this conditional task.
> -
>
> Key: HIVE-5945
> URL: https://issues.apache.org/jira/browse/HIVE-5945
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.12.0, 0.11.0, 0.10.0, 0.9.0, 0.8.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Navis
>Priority: Critical
> Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, 
> HIVE-5945.3.patch.txt
>
>
> Here is an example
> {code}
> select
>i_item_id,
>s_state,
>avg(ss_quantity) agg1,
>avg(ss_list_price) agg2,
>avg(ss_coupon_amt) agg3,
>avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
> customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>cd_gender = 'F' and
>cd_marital_status = 'U' and
>cd_education_status = 'Primary' and
>d_year = 2002 and
>s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>i_item_id,
>s_state
> order by
>i_item_id,
>s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
> jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
> date_dim) and 3 MR job (for reduce joins.)
> So, I checked the conditional task determining the plan of the join involving 
> item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
> aliasToFileSizeMap contains all input tables used in this query and the 
> intermediate table generated by joining store_sales and date_dim. So, when we 
> sum the size of all small tables, the size of store_sales (which is around 
> 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2013-12-18 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13851756#comment-13851756
 ] 

Yin Huai commented on HIVE-5945:


Two minor comments in the review board.

Two additional comments.
When we find 
{code}
bigTableFileAlias != null
{\code}
can we also log sumOfOthers and the threshold of the size of small tables? So, 
the log entry will show the size of the big table, the total size of other 
small tables, and the threshold of the size of small tables.
Also, can you add a unit test?

Thanks :)

> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
> tables which are not used in the child of this conditional task.
> -
>
> Key: HIVE-5945
> URL: https://issues.apache.org/jira/browse/HIVE-5945
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Navis
>Priority: Critical
> Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt, 
> HIVE-5945.3.patch.txt
>
>
> Here is an example
> {code}
> select
>i_item_id,
>s_state,
>avg(ss_quantity) agg1,
>avg(ss_list_price) agg2,
>avg(ss_coupon_amt) agg3,
>avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
> customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>cd_gender = 'F' and
>cd_marital_status = 'U' and
>cd_education_status = 'Primary' and
>d_year = 2002 and
>s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>i_item_id,
>s_state
> order by
>i_item_id,
>s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
> jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
> date_dim) and 3 MR job (for reduce joins.)
> So, I checked the conditional task determining the plan of the join involving 
> item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
> aliasToFileSizeMap contains all input tables used in this query and the 
> intermediate table generated by joining store_sales and date_dim. So, when we 
> sum the size of all small tables, the size of store_sales (which is around 
> 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2013-12-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13851260#comment-13851260
 ] 

Yin Huai commented on HIVE-5945:


Thanks [~navis] :) I left a few comments on the review board. I think the 
conditional task in the original trunk is not well tested. With a .q test file, 
we cannot test if a conditional task picks the right execution plan because the 
result of a .q file only shows the plan and the result. I think it is necessary 
to add a junit test to unit test the decision of resolveMapJoinTask. Also, 
let's add some logs in resolveMapJoinTask. Right now, we only have "xx is 
filtered out by condition resolver." and "xx is selected by condition 
resolver." in ConditionalTask. Through these two logs, we cannot know why a 
execution plan is selected. In resolveMapJoinTask, we can first log the size of 
tables which will be used in next task and then log why a path is selected.

> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
> tables which are not used in the child of this conditional task.
> -
>
> Key: HIVE-5945
> URL: https://issues.apache.org/jira/browse/HIVE-5945
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Navis
>Priority: Critical
> Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt
>
>
> Here is an example
> {code}
> select
>i_item_id,
>s_state,
>avg(ss_quantity) agg1,
>avg(ss_list_price) agg2,
>avg(ss_coupon_amt) agg3,
>avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
> customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>cd_gender = 'F' and
>cd_marital_status = 'U' and
>cd_education_status = 'Primary' and
>d_year = 2002 and
>s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>i_item_id,
>s_state
> order by
>i_item_id,
>s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
> jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
> date_dim) and 3 MR job (for reduce joins.)
> So, I checked the conditional task determining the plan of the join involving 
> item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
> aliasToFileSizeMap contains all input tables used in this query and the 
> intermediate table generated by joining store_sales and date_dim. So, when we 
> sum the size of all small tables, the size of store_sales (which is around 
> 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2013-12-17 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

Status: Open  (was: Patch Available)

> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
> tables which are not used in the child of this conditional task.
> -
>
> Key: HIVE-5945
> URL: https://issues.apache.org/jira/browse/HIVE-5945
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.12.0, 0.11.0, 0.10.0, 0.9.0, 0.8.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Navis
>Priority: Critical
> Attachments: HIVE-5945.1.patch.txt, HIVE-5945.2.patch.txt
>
>
> Here is an example
> {code}
> select
>i_item_id,
>s_state,
>avg(ss_quantity) agg1,
>avg(ss_list_price) agg2,
>avg(ss_coupon_amt) agg3,
>avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
> customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>cd_gender = 'F' and
>cd_marital_status = 'U' and
>cd_education_status = 'Primary' and
>d_year = 2002 and
>s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>i_item_id,
>s_state
> order by
>i_item_id,
>s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
> jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
> date_dim) and 3 MR job (for reduce joins.)
> So, I checked the conditional task determining the plan of the join involving 
> item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
> aliasToFileSizeMap contains all input tables used in this query and the 
> intermediate table generated by joining store_sales and date_dim. So, when we 
> sum the size of all small tables, the size of store_sales (which is around 
> 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HIVE-5891) Alias conflict when merging multiple mapjoin tasks into their common child mapred task

2013-12-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13851007#comment-13851007
 ] 

Yin Huai commented on HIVE-5891:


[~sunrui] Sorry for getting back late.

I just took a look at QB. Seems it uses aliasToSubq to store the mapping from 
aliases to sub query expressions (QBExpr). Then, a QBExpr also stores a QB 
which represents the subquery QB. With this recursive way, all QBs for 
different levels of the query are stored. So, parseCtx.getQB() only gets the 
main query block and its id is null. I am not sure if we can get the right QB 
(the QB for a subquery) from GenMapRedUtils.splitTasks... Can you take a quick 
look to see if it is easy to get the correct QB? If so, we can use the id of a 
QB to replace INTNAME. If not, let's use joinTree.getId for those 
JoinOperators. Seems we do not need to take special care to DemuxOperator. Can 
you create a review request for your patch? I can leave comments on the review 
board.

Also, since QBJoinTree.getJoinStreamDesc is not used, let's delete it.

> Alias conflict when merging multiple mapjoin tasks into their common child 
> mapred task
> --
>
> Key: HIVE-5891
> URL: https://issues.apache.org/jira/browse/HIVE-5891
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.12.0
>Reporter: Sun Rui
>Assignee: Sun Rui
> Attachments: HIVE-5891.1.patch
>
>
> Use the following test case with HIVE 0.12:
> {quote}
> create table src(key int, value string);
> load data local inpath 'src/data/files/kv1.txt' overwrite into table src;
> select * from (
>   select c.key from
> (select a.key from src a join src b on a.key=b.key group by a.key) tmp
> join src c on tmp.key=c.key
>   union all
>   select c.key from
> (select a.key from src a join src b on a.key=b.key group by a.key) tmp
> join src c on tmp.key=c.key
> ) x;
> {quote}
> We will get a NullPointerException from Union Operator:
> {quote}
> java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> Hive Runtime Error while processing row {"_col0":0}
>   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:175)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row {"_col0":0}
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:544)
>   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157)
>   ... 4 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.UnionOperator.processOp(UnionOperator.java:120)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:652)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:655)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:758)
>   at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:220)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
>   ... 5 more
> {quote}
>   
> The root cause is in 
> CommonJoinTaskDispatcher.mergeMapJoinTaskIntoItsChildMapRedTask().
>   +--+  +--+
>   | MapJoin task |  | MapJoin task |
>   +--+  +--+
>  \ /
>   \   /
>  +--+
>  |  Union task  |
>  +--+
>  
> CommonJoinTaskDispatcher merges the two MapJoin tasks into their c

[jira] [Commented] (HIVE-6043) Document incompatible changes in Hive 0.12 and trunk

2013-12-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13850759#comment-13850759
 ] 

Yin Huai commented on HIVE-6043:


I added HIVE-4827, which removed the flag of "hive.optimize.mapjoin.mapreduce".

> Document incompatible changes in Hive 0.12 and trunk
> 
>
> Key: HIVE-6043
> URL: https://issues.apache.org/jira/browse/HIVE-6043
> Project: Hive
>  Issue Type: Task
>Reporter: Brock Noland
>Priority: Blocker
>
> We need to document incompatible changes. For example
> * HIVE-5372 changed object inspector hierarchy breaking most if not all 
> custom serdes
> * HIVE-1511/HIVE-5263 serializes ObjectInspectors with Kryo so all custom 
> serdes (fixed by HIVE-5380)
> * Hive 0.12 separates MapredWork into MapWork and ReduceWork which is used by 
> Serdes
> * HIVE-5411 serializes expressions with Kryo which are used by custom serdes
> * HIVE-4827 removed the flag of "hive.optimize.mapjoin.mapreduce" (This flag 
> was introduced in Hive 0.11 by HIVE-3952).



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HIVE-6043) Document incompatible changes in Hive 0.12 and trunk

2013-12-17 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-6043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-6043:
---

Description: 
We need to document incompatible changes. For example

* HIVE-5372 changed object inspector hierarchy breaking most if not all custom 
serdes
* HIVE-1511/HIVE-5263 serializes ObjectInspectors with Kryo so all custom 
serdes (fixed by HIVE-5380)
* Hive 0.12 separates MapredWork into MapWork and ReduceWork which is used by 
Serdes
* HIVE-5411 serializes expressions with Kryo which are used by custom serdes
* HIVE-4827 removed the flag of "hive.optimize.mapjoin.mapreduce" (This flag 
was introduced in Hive 0.11 by HIVE-3952).


  was:
We need to document incompatible changes. For example

* HIVE-5372 changed object inspector hierarchy breaking most if not all custom 
serdes
* HIVE-1511/HIVE-5263 serializes ObjectInspectors with Kryo so all custom 
serdes (fixed by HIVE-5380)
* Hive 0.12 separates MapredWork into MapWork and ReduceWork which is used by 
Serdes
* HIVE-5411 serializes expressions with Kryo which are used by custom serdes



> Document incompatible changes in Hive 0.12 and trunk
> 
>
> Key: HIVE-6043
> URL: https://issues.apache.org/jira/browse/HIVE-6043
> Project: Hive
>  Issue Type: Task
>Reporter: Brock Noland
>Priority: Blocker
>
> We need to document incompatible changes. For example
> * HIVE-5372 changed object inspector hierarchy breaking most if not all 
> custom serdes
> * HIVE-1511/HIVE-5263 serializes ObjectInspectors with Kryo so all custom 
> serdes (fixed by HIVE-5380)
> * Hive 0.12 separates MapredWork into MapWork and ReduceWork which is used by 
> Serdes
> * HIVE-5411 serializes expressions with Kryo which are used by custom serdes
> * HIVE-4827 removed the flag of "hive.optimize.mapjoin.mapreduce" (This flag 
> was introduced in Hive 0.11 by HIVE-3952).



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Created] (HIVE-6007) Make the output of the reduce side plan optimized by the correlation optimizer more reader-friendly.

2013-12-11 Thread Yin Huai (JIRA)
Yin Huai created HIVE-6007:
--

 Summary: Make the output of the reduce side plan optimized by the 
correlation optimizer more reader-friendly.
 Key: HIVE-6007
 URL: https://issues.apache.org/jira/browse/HIVE-6007
 Project: Hive
  Issue Type: Sub-task
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Minor


Because a MuxOperator can have multiple parents, the output of the plan can 
show the sub-plan starting from this MuxOperator multiple times, which makes 
the reduce side plan confusing. An example is shown in 
https://mail-archives.apache.org/mod_mbox/hive-user/201312.mbox/%3CCAO0ZKSjniR0z%2BOt4KWouq236fKXo%3D5nE_Oih7A87e3HiuBsG9w%40mail.gmail.com%3E.




--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2013-12-10 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844968#comment-13844968
 ] 

Yin Huai commented on HIVE-5945:


Thanks [~navis] for taking this issue. Can you attach the link to the review 
board? Also, I saw 
{code}
+// todo: should nullify summary for non-native tables,
+// not to be selected as a mapjoin target
{\code}
in your patch. Does a "non-native" table mean an intermediate table? If so, I 
think for a conditional task, it's better to keep the option to use the 
intermediate table as the small table.

> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
> tables which are not used in the child of this conditional task.
> -
>
> Key: HIVE-5945
> URL: https://issues.apache.org/jira/browse/HIVE-5945
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Navis
>Priority: Critical
> Attachments: HIVE-5945.1.patch.txt
>
>
> Here is an example
> {code}
> select
>i_item_id,
>s_state,
>avg(ss_quantity) agg1,
>avg(ss_list_price) agg2,
>avg(ss_coupon_amt) agg3,
>avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
> customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>cd_gender = 'F' and
>cd_marital_status = 'U' and
>cd_education_status = 'Primary' and
>d_year = 2002 and
>s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>i_item_id,
>s_state
> order by
>i_item_id,
>s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
> jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
> date_dim) and 3 MR job (for reduce joins.)
> So, I checked the conditional task determining the plan of the join involving 
> item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
> aliasToFileSizeMap contains all input tables used in this query and the 
> intermediate table generated by joining store_sales and date_dim. So, when we 
> sum the size of all small tables, the size of store_sales (which is around 
> 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those tables which are not used in the child of this conditional task.

2013-12-04 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

Summary: ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums 
those tables which are not used in the child of this conditional task.  (was: 
ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' sizes 
including those tables which are not used in the child of this conditional 
task.)

> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask also sums those 
> tables which are not used in the child of this conditional task.
> -
>
> Key: HIVE-5945
> URL: https://issues.apache.org/jira/browse/HIVE-5945
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
>Reporter: Yin Huai
>
> Here is an example
> {code}
> select
>i_item_id,
>s_state,
>avg(ss_quantity) agg1,
>avg(ss_list_price) agg2,
>avg(ss_coupon_amt) agg3,
>avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
> customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>cd_gender = 'F' and
>cd_marital_status = 'U' and
>cd_education_status = 'Primary' and
>d_year = 2002 and
>s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>i_item_id,
>s_state
> order by
>i_item_id,
>s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
> jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
> date_dim) and 3 MR job (for reduce joins.)
> So, I checked the conditional task determining the plan of the join involving 
> item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
> aliasToFileSizeMap contains all input tables used in this query and the 
> intermediate table generated by joining store_sales and date_dim. So, when we 
> sum the size of all small tables, the size of store_sales (which is around 
> 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' sizes including those tables which are not used in the child of this conditional task.

2013-12-04 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839008#comment-13839008
 ] 

Yin Huai commented on HIVE-5945:


Seems this bug was introduced by HIVE-2095. I am marking all affected versions.

> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' 
> sizes including those tables which are not used in the child of this 
> conditional task.
> 
>
> Key: HIVE-5945
> URL: https://issues.apache.org/jira/browse/HIVE-5945
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
>Reporter: Yin Huai
>
> Here is an example
> {code}
> select
>i_item_id,
>s_state,
>avg(ss_quantity) agg1,
>avg(ss_list_price) agg2,
>avg(ss_coupon_amt) agg3,
>avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
> customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>cd_gender = 'F' and
>cd_marital_status = 'U' and
>cd_education_status = 'Primary' and
>d_year = 2002 and
>s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>i_item_id,
>s_state
> order by
>i_item_id,
>s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
> jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
> date_dim) and 3 MR job (for reduce joins.)
> So, I checked the conditional task determining the plan of the join involving 
> item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
> aliasToFileSizeMap contains all input tables used in this query and the 
> intermediate table generated by joining store_sales and date_dim. So, when we 
> sum the size of all small tables, the size of store_sales (which is around 
> 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' sizes including those tables which are not used in the child of this conditional task.

2013-12-04 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

Affects Version/s: 0.8.0
   0.9.0
   0.10.0
   0.11.0
   0.12.0

> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' 
> sizes including those tables which are not used in the child of this 
> conditional task.
> 
>
> Key: HIVE-5945
> URL: https://issues.apache.org/jira/browse/HIVE-5945
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.12.0, 0.13.0
>Reporter: Yin Huai
>
> Here is an example
> {code}
> select
>i_item_id,
>s_state,
>avg(ss_quantity) agg1,
>avg(ss_list_price) agg2,
>avg(ss_coupon_amt) agg3,
>avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
> customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>cd_gender = 'F' and
>cd_marital_status = 'U' and
>cd_education_status = 'Primary' and
>d_year = 2002 and
>s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>i_item_id,
>s_state
> order by
>i_item_id,
>s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
> jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
> date_dim) and 3 MR job (for reduce joins.)
> So, I checked the conditional task determining the plan of the join involving 
> item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
> aliasToFileSizeMap contains all input tables used in this query and the 
> intermediate table generated by joining store_sales and date_dim. So, when we 
> sum the size of all small tables, the size of store_sales (which is around 
> 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' sizes including those tables which are not used in the child of this conditional task.

2013-12-04 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13838991#comment-13838991
 ] 

Yin Huai commented on HIVE-5945:


aliasToFileSizeMap should have aliases used in the next stage instead of all 
tables. 

> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' 
> sizes including those tables which are not used in the child of this 
> conditional task.
> 
>
> Key: HIVE-5945
> URL: https://issues.apache.org/jira/browse/HIVE-5945
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.13.0
>Reporter: Yin Huai
>
> Here is an example
> {code}
> select
>i_item_id,
>s_state,
>avg(ss_quantity) agg1,
>avg(ss_list_price) agg2,
>avg(ss_coupon_amt) agg3,
>avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
> customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>cd_gender = 'F' and
>cd_marital_status = 'U' and
>cd_education_status = 'Primary' and
>d_year = 2002 and
>s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>i_item_id,
>s_state
> order by
>i_item_id,
>s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
> jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
> date_dim) and 3 MR job (for reduce joins.)
> So, I checked the conditional task determining the plan of the join involving 
> item. In Hive   
> HiveHIVE-5945
> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap 
> contains all input tables used in this query and the intermediate table 
> generated by joining store_sales and date_dim. So, when we sum the size of 
> all small tables, the size of store_sales (which is around 45GB in my test) 
> will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' sizes including those tables which are not used in the child of this conditional task.

2013-12-04 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

Description: 
Here is an example
{code}
select
   i_item_id,
   s_state,
   avg(ss_quantity) agg1,
   avg(ss_list_price) agg2,
   avg(ss_coupon_amt) agg3,
   avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
   cd_gender = 'F' and
   cd_marital_status = 'U' and
   cd_education_status = 'Primary' and
   d_year = 2002 and
   s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
   i_item_id,
   s_state
order by
   i_item_id,
   s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
date_dim) and 3 MR job (for reduce joins.)

So, I checked the conditional task determining the plan of the join involving 
item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
aliasToFileSizeMap contains all input tables used in this query and the 
intermediate table generated by joining store_sales and date_dim. So, when we 
sum the size of all small tables, the size of store_sales (which is around 45GB 
in my test) will be also counted.  

  was:
Here is an example
{code}
select
   i_item_id,
   s_state,
   avg(ss_quantity) agg1,
   avg(ss_list_price) agg2,
   avg(ss_coupon_amt) agg3,
   avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
   cd_gender = 'F' and
   cd_marital_status = 'U' and
   cd_education_status = 'Primary' and
   d_year = 2002 and
   s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
   i_item_id,
   s_state
order by
   i_item_id,
   s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
date_dim) and 3 MR job (for reduce joins.)

So, I checked the conditional task determining the plan of the join involving 
item. In Hive 
HiveHIVE-5945
ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap 
contains all input tables used in this query and the intermediate table 
generated by joining store_sales and date_dim. So, when we sum the size of all 
small tables, the size of store_sales (which is around 45GB in my test) will be 
also counted.  


> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' 
> sizes including those tables which are not used in the child of this 
> conditional task.
> 
>
> Key: HIVE-5945
> URL: https://issues.apache.org/jira/browse/HIVE-5945
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.13.0
>Reporter: Yin Huai
>
> Here is an example
> {code}
> select
>i_item_id,
>s_state,
>avg(ss_quantity) agg1,
>avg(ss_list_price) agg2,
>avg(ss_coupon_amt) agg3,
>avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
> customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>cd_gender = 'F' and
>cd_marital_status = 'U' and
>cd_education_status = 'Primary' and
>d_year = 2002 and
>s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>i_item_id,
>s_state
> order by
>i_item_id,
>s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
> jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
> date_dim) and 3 MR job (for reduce joins.)
> So, I checked the conditional task determining the plan of the join involving 
> item. In ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, 
> aliasToFileSizeMap contains all input tables used in this query and the 
> intermediate table generated by joining store_sales and date_dim. So, when we 
> sum the size of all small tables, the size of store_sales (which is around 
> 45GB in my test) will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' sizes including those tables which are not used in the child of this conditional task.

2013-12-04 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

Description: 
Here is an example
{code}
select
   i_item_id,
   s_state,
   avg(ss_quantity) agg1,
   avg(ss_list_price) agg2,
   avg(ss_coupon_amt) agg3,
   avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
   cd_gender = 'F' and
   cd_marital_status = 'U' and
   cd_education_status = 'Primary' and
   d_year = 2002 and
   s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
   i_item_id,
   s_state
order by
   i_item_id,
   s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
date_dim) and 3 MR job (for reduce joins.)

So, I checked the conditional task determining the plan of the join involving 
item. In Hive 
HiveHIVE-5945
ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap 
contains all input tables used in this query and the intermediate table 
generated by joining store_sales and date_dim. So, when we sum the size of all 
small tables, the size of store_sales (which is around 45GB in my test) will be 
also counted.  

  was:
Here is an example
{code}
select
   i_item_id,
   s_state,
   avg(ss_quantity) agg1,
   avg(ss_list_price) agg2,
   avg(ss_coupon_amt) agg3,
   avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
   cd_gender = 'F' and
   cd_marital_status = 'U' and
   cd_education_status = 'Primary' and
   d_year = 2002 and
   s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
   i_item_id,
   s_state
order by
   i_item_id,
   s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
date_dim) and 3 MR job (for reduce joins.)

So, I checked the conditional task determining the plan of the join involving 
item. In 


> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' 
> sizes including those tables which are not used in the child of this 
> conditional task.
> 
>
> Key: HIVE-5945
> URL: https://issues.apache.org/jira/browse/HIVE-5945
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.13.0
>Reporter: Yin Huai
>
> Here is an example
> {code}
> select
>i_item_id,
>s_state,
>avg(ss_quantity) agg1,
>avg(ss_list_price) agg2,
>avg(ss_coupon_amt) agg3,
>avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
> customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>cd_gender = 'F' and
>cd_marital_status = 'U' and
>cd_education_status = 'Primary' and
>d_year = 2002 and
>s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>i_item_id,
>s_state
> order by
>i_item_id,
>s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
> jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
> date_dim) and 3 MR job (for reduce joins.)
> So, I checked the conditional task determining the plan of the join involving 
> item. In Hive   
> HiveHIVE-5945
> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask, aliasToFileSizeMap 
> contains all input tables used in this query and the intermediate table 
> generated by joining store_sales and date_dim. So, when we sum the size of 
> all small tables, the size of store_sales (which is around 45GB in my test) 
> will be also counted.  



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' sizes including those tables which are not used in the child of this conditional task.

2013-12-04 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

Description: 
Here is an example
{code}
select
   i_item_id,
   s_state,
   avg(ss_quantity) agg1,
   avg(ss_list_price) agg2,
   avg(ss_coupon_amt) agg3,
   avg(ss_sales_price) agg4
FROM store_sales
JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
customer_demographics.cd_demo_sk)
JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
where
   cd_gender = 'F' and
   cd_marital_status = 'U' and
   cd_education_status = 'Primary' and
   d_year = 2002 and
   s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
group by
   i_item_id,
   s_state
order by
   i_item_id,
   s_state
limit 100;
{\code}
I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
date_dim) and 3 MR job (for reduce joins.)

So, I checked the conditional task determining the plan of the join involving 
item. In 

> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' 
> sizes including those tables which are not used in the child of this 
> conditional task.
> 
>
> Key: HIVE-5945
> URL: https://issues.apache.org/jira/browse/HIVE-5945
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.13.0
>Reporter: Yin Huai
>
> Here is an example
> {code}
> select
>i_item_id,
>s_state,
>avg(ss_quantity) agg1,
>avg(ss_list_price) agg2,
>avg(ss_coupon_amt) agg3,
>avg(ss_sales_price) agg4
> FROM store_sales
> JOIN date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
> JOIN item on (store_sales.ss_item_sk = item.i_item_sk)
> JOIN customer_demographics on (store_sales.ss_cdemo_sk = 
> customer_demographics.cd_demo_sk)
> JOIN store on (store_sales.ss_store_sk = store.s_store_sk)
> where
>cd_gender = 'F' and
>cd_marital_status = 'U' and
>cd_education_status = 'Primary' and
>d_year = 2002 and
>s_state in ('GA','PA', 'LA', 'SC', 'MI', 'AL')
> group by
>i_item_id,
>s_state
> order by
>i_item_id,
>s_state
> limit 100;
> {\code}
> I turned off noconditionaltask. So, I expected that there will be 4 Map-only 
> jobs for this query. However, I got 1 Map-only job (joining strore_sales and 
> date_dim) and 3 MR job (for reduce joins.)
> So, I checked the conditional task determining the plan of the join involving 
> item. In 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' sizes including those tables which are not used in the child of this conditional task.

2013-12-04 Thread Yin Huai (JIRA)
Yin Huai created HIVE-5945:
--

 Summary: ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask 
sums all tables' sizes including those tables which are not used in the child 
of this conditional task.
 Key: HIVE-5945
 URL: https://issues.apache.org/jira/browse/HIVE-5945
 Project: Hive
  Issue Type: Bug
Reporter: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' sizes including those tables which are not used in the child of this conditional task.

2013-12-04 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

Component/s: Query Processor

> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' 
> sizes including those tables which are not used in the child of this 
> conditional task.
> 
>
> Key: HIVE-5945
> URL: https://issues.apache.org/jira/browse/HIVE-5945
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.13.0
>Reporter: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5945) ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' sizes including those tables which are not used in the child of this conditional task.

2013-12-04 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5945:
---

Affects Version/s: 0.13.0

> ql.plan.ConditionalResolverCommonJoin.resolveMapJoinTask sums all tables' 
> sizes including those tables which are not used in the child of this 
> conditional task.
> 
>
> Key: HIVE-5945
> URL: https://issues.apache.org/jira/browse/HIVE-5945
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.13.0
>Reporter: Yin Huai
>




--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5922) In orc.InStream.CompressedStream, the desired position passed to seek can equal offsets[i] + bytes[i].remaining() when ORC predicate pushdown is enabled

2013-12-02 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13837275#comment-13837275
 ] 

Yin Huai commented on HIVE-5922:


For the first trace, the desired position is 21496054 and the second range is 
"range 2 = 20447466 to 1048588". For the second trace, the desired position is 
20447466 and the sixth range is "range 6 = 18612437 to 1835029". 

When I turned off predicate pushdown or I used predicate pushdown with 
uncompressed data, I did not see this problem.

> In orc.InStream.CompressedStream, the desired position passed to seek can 
> equal offsets[i] + bytes[i].remaining() when ORC predicate pushdown is enabled
> 
>
> Key: HIVE-5922
> URL: https://issues.apache.org/jira/browse/HIVE-5922
> Project: Hive
>  Issue Type: Bug
>  Components: File Formats
>Reporter: Yin Huai
>
> Two stack traces ...
> {code}
> java.io.IOException: IO error in map input file 
> hdfs://10.38.55.204:8020/user/hive/warehouse/ssdb_bin_compress_orc_large_0_13.db/cycle/04_0
>   at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
>   at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.io.IOException: java.io.IOException: Seek outside of data in 
> compressed stream Stream for column 9 kind DATA position: 21496054 length: 
> 33790900 range: 2 offset: 1048588 limit: 1048588 range 0 = 13893791 to 
> 1048588;  range 1 = 17039555 to 1310735;  range 2 = 20447466 to 1048588;  
> range 3 = 23855377 to 1048588;  range 4 = 27263288 to 1048588;  range 5 = 
> 30409052 to 1310735 uncompressed: 262144 to 262144 to 21496054
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
>   at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
>   at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
>   at 
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
>   at 
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
>   at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
>   at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
>   ... 9 more
> Caused by: java.io.IOException: Seek outside of data in compressed stream 
> Stream for column 9 kind DATA position: 21496054 length: 33790900 range: 2 
> offset: 1048588 limit: 1048588 range 0 = 13893791 to 1048588;  range 1 = 
> 17039555 to 1310735;  range 2 = 20447466 to 1048588;  range 3 = 23855377 to 
> 1048588;  range 4 = 27263288 to 1048588;  range 5 = 30409052 to 1310735 
> uncompressed: 262144 to 262144 to 21496054
>   at 
> org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.seek(InStream.java:328)
>   at 
> org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:161)
>   at 
> org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:205)
>   at 
> org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readInts(SerializationUtils.java:450)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readDirectValues(RunLengthIntegerReaderV2.java:240)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:53)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:288)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$IntTreeReader.next(RecordReaderImpl.java:510)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.next(RecordReaderImpl.java:1581)
>   at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:2707)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:110)
>   at 
> org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

[jira] [Created] (HIVE-5922) In orc.InStream.CompressedStream, the desired position passed to seek can equal offsets[i] + bytes[i].remaining() when ORC predicate pushdown is enabled

2013-12-02 Thread Yin Huai (JIRA)
Yin Huai created HIVE-5922:
--

 Summary: In orc.InStream.CompressedStream, the desired position 
passed to seek can equal offsets[i] + bytes[i].remaining() when ORC predicate 
pushdown is enabled
 Key: HIVE-5922
 URL: https://issues.apache.org/jira/browse/HIVE-5922
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Yin Huai


Two stack traces ...
{code}
java.io.IOException: IO error in map input file 
hdfs://10.38.55.204:8020/user/hive/warehouse/ssdb_bin_compress_orc_large_0_13.db/cycle/04_0
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.io.IOException: java.io.IOException: Seek outside of data in 
compressed stream Stream for column 9 kind DATA position: 21496054 length: 
33790900 range: 2 offset: 1048588 limit: 1048588 range 0 = 13893791 to 1048588; 
 range 1 = 17039555 to 1310735;  range 2 = 20447466 to 1048588;  range 3 = 
23855377 to 1048588;  range 4 = 27263288 to 1048588;  range 5 = 30409052 to 
1310735 uncompressed: 262144 to 262144 to 21496054
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:276)
at 
org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
at 
org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:108)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:230)
... 9 more
Caused by: java.io.IOException: Seek outside of data in compressed stream 
Stream for column 9 kind DATA position: 21496054 length: 33790900 range: 2 
offset: 1048588 limit: 1048588 range 0 = 13893791 to 1048588;  range 1 = 
17039555 to 1310735;  range 2 = 20447466 to 1048588;  range 3 = 23855377 to 
1048588;  range 4 = 27263288 to 1048588;  range 5 = 30409052 to 1310735 
uncompressed: 262144 to 262144 to 21496054
at 
org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.seek(InStream.java:328)
at 
org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.readHeader(InStream.java:161)
at 
org.apache.hadoop.hive.ql.io.orc.InStream$CompressedStream.read(InStream.java:205)
at 
org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readInts(SerializationUtils.java:450)
at 
org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readDirectValues(RunLengthIntegerReaderV2.java:240)
at 
org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:53)
at 
org.apache.hadoop.hive.ql.io.orc.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:288)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$IntTreeReader.next(RecordReaderImpl.java:510)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl$StructTreeReader.next(RecordReaderImpl.java:1581)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:2707)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:110)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$OrcRecordReader.next(OrcInputFormat.java:86)
at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
... 13 more
{\code}

{code}
java.io.IOException: IO error in map input file 
hdfs://10.38.55.204:8020/user/hive/warehouse/ssdb_bin_compress_orc_large_0_13.db/cycle/95_0
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:210)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at o

[jira] [Commented] (HIVE-5910) In HiveConf, the name of mapred.min.split.size.per.rack is MAPREDMINSPLITSIZEPERNODE and the name of mapred.min.split.size.per.node is MAPREDMINSPLITSIZEPERRACK

2013-11-30 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13835921#comment-13835921
 ] 

Yin Huai commented on HIVE-5910:


[~leftylev] Actually, these two are MapReduce configurations. Seems these two 
are internally used in Hive. I am not sure if we need to add them to add them 
to our conf template.

> In HiveConf, the name of mapred.min.split.size.per.rack is 
> MAPREDMINSPLITSIZEPERNODE and the name of mapred.min.split.size.per.node is 
> MAPREDMINSPLITSIZEPERRACK
> 
>
> Key: HIVE-5910
> URL: https://issues.apache.org/jira/browse/HIVE-5910
> Project: Hive
>  Issue Type: Bug
>Reporter: Yin Huai
>
> In HiveConf.java ...
> {code}
> MAPREDMINSPLITSIZEPERNODE("mapred.min.split.size.per.rack", 1L),
> MAPREDMINSPLITSIZEPERRACK("mapred.min.split.size.per.node", 1L),
> {\code}
> Then, in ExecDriver.java ...
> {code}
> if (mWork.getMinSplitSizePerNode() != null) {
>   HiveConf.setLongVar(job, HiveConf.ConfVars.MAPREDMINSPLITSIZEPERNODE, 
> mWork.getMinSplitSizePerNode().longValue());
> }
>  if (mWork.getMinSplitSizePerRack() != null) {
>   HiveConf.setLongVar(job, HiveConf.ConfVars.MAPREDMINSPLITSIZEPERRACK, 
> mWork.getMinSplitSizePerRack().longValue());
> }
> {\code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Moved] (HIVE-5910) In HiveConf, the name of mapred.min.split.size.per.rack is MAPREDMINSPLITSIZEPERNODE and the name of mapred.min.split.size.per.node is MAPREDMINSPLITSIZEPERRACK

2013-11-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai moved MAPREDUCE-5659 to HIVE-5910:
---

Key: HIVE-5910  (was: MAPREDUCE-5659)
Project: Hive  (was: Hadoop Map/Reduce)

> In HiveConf, the name of mapred.min.split.size.per.rack is 
> MAPREDMINSPLITSIZEPERNODE and the name of mapred.min.split.size.per.node is 
> MAPREDMINSPLITSIZEPERRACK
> 
>
> Key: HIVE-5910
> URL: https://issues.apache.org/jira/browse/HIVE-5910
> Project: Hive
>  Issue Type: Bug
>Reporter: Yin Huai
>
> In HiveConf.java ...
> {code}
> MAPREDMINSPLITSIZEPERNODE("mapred.min.split.size.per.rack", 1L),
> MAPREDMINSPLITSIZEPERRACK("mapred.min.split.size.per.node", 1L),
> {\code}
> Then, in ExecDriver.java ...
> {code}
> if (mWork.getMinSplitSizePerNode() != null) {
>   HiveConf.setLongVar(job, HiveConf.ConfVars.MAPREDMINSPLITSIZEPERNODE, 
> mWork.getMinSplitSizePerNode().longValue());
> }
>  if (mWork.getMinSplitSizePerRack() != null) {
>   HiveConf.setLongVar(job, HiveConf.ConfVars.MAPREDMINSPLITSIZEPERRACK, 
> mWork.getMinSplitSizePerRack().longValue());
> }
> {\code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5910) In HiveConf, the name of mapred.min.split.size.per.rack is MAPREDMINSPLITSIZEPERNODE and the name of mapred.min.split.size.per.node is MAPREDMINSPLITSIZEPERRACK

2013-11-30 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13835768#comment-13835768
 ] 

Yin Huai commented on HIVE-5910:


my bad... did not notice the project when I created it... It has been moved to 
hive. Thanks for letting me know, Ted :)

> In HiveConf, the name of mapred.min.split.size.per.rack is 
> MAPREDMINSPLITSIZEPERNODE and the name of mapred.min.split.size.per.node is 
> MAPREDMINSPLITSIZEPERRACK
> 
>
> Key: HIVE-5910
> URL: https://issues.apache.org/jira/browse/HIVE-5910
> Project: Hive
>  Issue Type: Bug
>Reporter: Yin Huai
>
> In HiveConf.java ...
> {code}
> MAPREDMINSPLITSIZEPERNODE("mapred.min.split.size.per.rack", 1L),
> MAPREDMINSPLITSIZEPERRACK("mapred.min.split.size.per.node", 1L),
> {\code}
> Then, in ExecDriver.java ...
> {code}
> if (mWork.getMinSplitSizePerNode() != null) {
>   HiveConf.setLongVar(job, HiveConf.ConfVars.MAPREDMINSPLITSIZEPERNODE, 
> mWork.getMinSplitSizePerNode().longValue());
> }
>  if (mWork.getMinSplitSizePerRack() != null) {
>   HiveConf.setLongVar(job, HiveConf.ConfVars.MAPREDMINSPLITSIZEPERRACK, 
> mWork.getMinSplitSizePerRack().longValue());
> }
> {\code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5891) Alias conflict when merging multiple mapjoin tasks into their common child mapred task

2013-11-27 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13833857#comment-13833857
 ] 

Yin Huai commented on HIVE-5891:


I think the main problem is mergeMapJoinTaskIntoItsChildMapRedTask happens in 
the physical optimization phase which is after we break the plan using 
GenMapRedUtils. In this case 
{code}
while (cplan.getMapWork().getAliasToWork().get(streamDesc) != null) {
  streamDesc = origStreamDesc.concat(String.valueOf(++pos));
}
{\code}
will not help because those MapJoins were ReduceJoins and they were in 
different MR jobs. Also, seems the pattern triggers the bug looks like this...
{code}
 Union or Join
 /\
/  \
   MapJoin1 MapJoin1
  /   \/\
   MR1 small1 MR2small2
{\code}
In here, MR1 and MR2 are two MapReduce jobs which generates intermediate 
datasets. small1 and small2 are two small tables. When 
mergeMapJoinTaskIntoItsChildMapRedTask attaches MapJoin1 and MapJoin2 to the 
Map phase of the job for Union or Join, MR1 and MR2 has the same alias... 
Actually, I am thinking using the id of a QB may be a good alias for an 
intermediate dataset. Thoughts?

I think your change will not affect DemuxOperator because before GenMapRedUtils 
starts to work, Correlation Optimizer (HIVE-2206) has already generated the 
optimized plan. But let's give it a try. Can you try this query and see if 
there is anything wrong?
{code:sql}
set hive.optimize.correlation=true;
SELECT tmp1.key
FROM (SELECT key, value
 FROM src
GROUP BY a.key, b.value) tmp1
JOIN
   (SELECT key, value
FROM src
   GROUP BY key, value) tmp2
ON (tmp1.key=tmp2.key)
JOIN
 (SELECT key
  FROM src
  GROUP BY key) tmp3
ON (tmp2.key=tmp3.key)
GROUP BY tmp1.key
{code}
The plan should have three MR jobs. The first one is used to evaluate tmp1. The 
second is used to evaluate tmp2. And the third one is used to evaluate the join 
of tmp1, tmp2, and tmp3, and gby.


> Alias conflict when merging multiple mapjoin tasks into their common child 
> mapred task
> --
>
> Key: HIVE-5891
> URL: https://issues.apache.org/jira/browse/HIVE-5891
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.12.0
>Reporter: Sun Rui
>Assignee: Sun Rui
> Attachments: HIVE-5891.1.patch
>
>
> Use the following test case with HIVE 0.12:
> {quote}
> create table src(key int, value string);
> load data local inpath 'src/data/files/kv1.txt' overwrite into table src;
> select * from (
>   select c.key from
> (select a.key from src a join src b on a.key=b.key group by a.key) tmp
> join src c on tmp.key=c.key
>   union all
>   select c.key from
> (select a.key from src a join src b on a.key=b.key group by a.key) tmp
> join src c on tmp.key=c.key
> ) x;
> {quote}
> We will get a NullPointerException from Union Operator:
> {quote}
> java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> Hive Runtime Error while processing row {"_col0":0}
>   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:175)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row {"_col0":0}
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:544)
>   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157)
>   ... 4 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.UnionOperator.processOp(UnionOperator.java:120)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:652)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:655)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:758)
>   at 
> org.apache.hadoop.hive.ql.exec.MapJoi

[jira] [Commented] (HIVE-5891) Alias conflict when merging multiple mapjoin tasks into their common child mapred task

2013-11-26 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13833409#comment-13833409
 ] 

Yin Huai commented on HIVE-5891:


Thanks [~sunrui] for confirming the plan. Will "JOIN_INTERMEDIATE" give an 
impression that the dataset is an intermediate dataset during the processing of 
join instead of an input dataset?

Also, I am sorry that I did not get your question about DemuxOperator. Why 
DemuxOperator is related to this issue?  I think Demux is not related to your 
change since it is an operator at the reducer side. 

> Alias conflict when merging multiple mapjoin tasks into their common child 
> mapred task
> --
>
> Key: HIVE-5891
> URL: https://issues.apache.org/jira/browse/HIVE-5891
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.12.0
>Reporter: Sun Rui
>Assignee: Sun Rui
> Attachments: HIVE-5891.1.patch
>
>
> Use the following test case with HIVE 0.12:
> {quote}
> create table src(key int, value string);
> load data local inpath 'src/data/files/kv1.txt' overwrite into table src;
> select * from (
>   select c.key from
> (select a.key from src a join src b on a.key=b.key group by a.key) tmp
> join src c on tmp.key=c.key
>   union all
>   select c.key from
> (select a.key from src a join src b on a.key=b.key group by a.key) tmp
> join src c on tmp.key=c.key
> ) x;
> {quote}
> We will get a NullPointerException from Union Operator:
> {quote}
> java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> Hive Runtime Error while processing row {"_col0":0}
>   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:175)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row {"_col0":0}
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:544)
>   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157)
>   ... 4 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.UnionOperator.processOp(UnionOperator.java:120)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:652)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:655)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:758)
>   at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:220)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
>   ... 5 more
> {quote}
>   
> The root cause is in 
> CommonJoinTaskDispatcher.mergeMapJoinTaskIntoItsChildMapRedTask().
>   +--+  +--+
>   | MapJoin task |  | MapJoin task |
>   +--+  +--+
>  \ /
>   \   /
>  +--+
>  |  Union task  |
>  +--+
>  
> CommonJoinTaskDispatcher merges the two MapJoin tasks into their common 
> child: Union task. The two MapJoin tasks have the same alias name for their 
> big tables: $INTNAME, which is the name of the temporary table of a join 
> stream. The aliasToWork map uses alias as key, so eventually only the MapJoin 
> operator tree of one MapJoin task is saved into the aliasToWork map of the 
> Union task, while the MapJoin operator tree of another MapJoin task is lost. 
> As a result, Union operator won't be initialized because not all of its 
> parents gets

[jira] [Commented] (HIVE-5891) Alias conflict when merging multiple mapjoin tasks into their common child mapred task

2013-11-26 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832779#comment-13832779
 ] 

Yin Huai commented on HIVE-5891:


[~sunrui] What will the plan of the query in the description look like with 
your patch? Will MapJoins and the Union be executed in the same job? Seems 
those two tmps appearing in the same position in those MapJoins triggered the 
bug. I was thinking if ids in those two QBJoinTrees are the same? If so, 
aliases of those two tables are probably still the same. 0.11 does not have 
this bug because it does not use a single job to evaluate those MapJoins and 
the Union.

I do not think it will affect Demux since Demux is at the reducer side.

btw, I also think "$INTNAME" is confusing... Seems it is used to represent 
those intermediate results. I'd like a name which has a meaningful part which 
can represent how this intermediate results are generated and a unique part to 
address the issue shown in this jira.

> Alias conflict when merging multiple mapjoin tasks into their common child 
> mapred task
> --
>
> Key: HIVE-5891
> URL: https://issues.apache.org/jira/browse/HIVE-5891
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 0.12.0
>Reporter: Sun Rui
>Assignee: Sun Rui
> Attachments: HIVE-5891.1.patch
>
>
> Use the following test case with HIVE 0.12:
> {quote}
> create table src(key int, value string);
> load data local inpath 'src/data/files/kv1.txt' overwrite into table src;
> select * from (
>   select c.key from
> (select a.key from src a join src b on a.key=b.key group by a.key) tmp
> join src c on tmp.key=c.key
>   union all
>   select c.key from
> (select a.key from src a join src b on a.key=b.key group by a.key) tmp
> join src c on tmp.key=c.key
> ) x;
> {quote}
> We will get a NullPointerException from Union Operator:
> {quote}
> java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> Hive Runtime Error while processing row {"_col0":0}
>   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:175)
>   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row {"_col0":0}
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:544)
>   at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157)
>   ... 4 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.hive.ql.exec.UnionOperator.processOp(UnionOperator.java:120)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:652)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:655)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:758)
>   at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:220)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91)
>   at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>   at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>   at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
>   ... 5 more
> {quote}
>   
> The root cause is in 
> CommonJoinTaskDispatcher.mergeMapJoinTaskIntoItsChildMapRedTask().
>   +--+  +--+
>   | MapJoin task |  | MapJoin task |
>   +--+  +--+
>  \ /
>   \   /
>  +--+
>  |  Union task  |
>  +--+
>  
> CommonJoinTaskDispatcher merges the two MapJoin tasks into their common 
> child: Union task. The two MapJoin tasks have the same alias n

[jira] [Updated] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-11-09 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5697:
---

Attachment: (was: HIVE-5697.2.patch)

> Correlation Optimizer may generate wrong plans for cases involving outer join
> -
>
> Key: HIVE-5697
> URL: https://issues.apache.org/jira/browse/HIVE-5697
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-5697.1.patch, HIVE-5697.2.patch
>
>
> For example,
> {code:sql}
> select x.key, y.value, count(*) from src x right outer join src1 y on 
> (x.key=y.key and x.value=y.value) group by x.key, y.value; 
> {code}
> Correlation optimizer will determine that a single MR job is enough for this 
> query. However, the group by key are from both left and right tables of the 
> right outer join. 
> We will have a wrong result like
> {code}
> NULL  4
> NULL  val_165 1
> NULL  val_193 1
> NULL  val_265 1
> NULL  val_27  1
> NULL  val_409 1
> NULL  val_484 1
> NULL  1
> 146   val_146 2
> 150   val_150 1
> 213   val_213 2
> NULL  1
> 238   val_238 2
> 255   val_255 2
> 273   val_273 3
> 278   val_278 2
> 311   val_311 3
> NULL  1
> 401   val_401 5
> 406   val_406 4
> 66val_66  1
> 98val_98  2
> {code}
> Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-11-09 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5697:
---

Attachment: HIVE-5697.2.patch

reuploading patch .2

> Correlation Optimizer may generate wrong plans for cases involving outer join
> -
>
> Key: HIVE-5697
> URL: https://issues.apache.org/jira/browse/HIVE-5697
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-5697.1.patch, HIVE-5697.2.patch
>
>
> For example,
> {code:sql}
> select x.key, y.value, count(*) from src x right outer join src1 y on 
> (x.key=y.key and x.value=y.value) group by x.key, y.value; 
> {code}
> Correlation optimizer will determine that a single MR job is enough for this 
> query. However, the group by key are from both left and right tables of the 
> right outer join. 
> We will have a wrong result like
> {code}
> NULL  4
> NULL  val_165 1
> NULL  val_193 1
> NULL  val_265 1
> NULL  val_27  1
> NULL  val_409 1
> NULL  val_484 1
> NULL  1
> 146   val_146 2
> 150   val_150 1
> 213   val_213 2
> NULL  1
> 238   val_238 2
> 255   val_255 2
> 273   val_273 3
> 278   val_278 2
> 311   val_311 3
> NULL  1
> 401   val_401 5
> 406   val_406 4
> 66val_66  1
> 98val_98  2
> {code}
> Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-11-09 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5697:
---

Status: Patch Available  (was: Open)

> Correlation Optimizer may generate wrong plans for cases involving outer join
> -
>
> Key: HIVE-5697
> URL: https://issues.apache.org/jira/browse/HIVE-5697
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-5697.1.patch, HIVE-5697.2.patch
>
>
> For example,
> {code:sql}
> select x.key, y.value, count(*) from src x right outer join src1 y on 
> (x.key=y.key and x.value=y.value) group by x.key, y.value; 
> {code}
> Correlation optimizer will determine that a single MR job is enough for this 
> query. However, the group by key are from both left and right tables of the 
> right outer join. 
> We will have a wrong result like
> {code}
> NULL  4
> NULL  val_165 1
> NULL  val_193 1
> NULL  val_265 1
> NULL  val_27  1
> NULL  val_409 1
> NULL  val_484 1
> NULL  1
> 146   val_146 2
> 150   val_150 1
> 213   val_213 2
> NULL  1
> 238   val_238 2
> 255   val_255 2
> 273   val_273 3
> 278   val_278 2
> 311   val_311 3
> NULL  1
> 401   val_401 5
> 406   val_406 4
> 66val_66  1
> 98val_98  2
> {code}
> Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-10-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5697:
---

Status: Patch Available  (was: Open)

> Correlation Optimizer may generate wrong plans for cases involving outer join
> -
>
> Key: HIVE-5697
> URL: https://issues.apache.org/jira/browse/HIVE-5697
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-5697.1.patch, HIVE-5697.2.patch
>
>
> For example,
> {code:sql}
> select x.key, y.value, count(*) from src x right outer join src1 y on 
> (x.key=y.key and x.value=y.value) group by x.key, y.value; 
> {code}
> Correlation optimizer will determine that a single MR job is enough for this 
> query. However, the group by key are from both left and right tables of the 
> right outer join. 
> We will have a wrong result like
> {code}
> NULL  4
> NULL  val_165 1
> NULL  val_193 1
> NULL  val_265 1
> NULL  val_27  1
> NULL  val_409 1
> NULL  val_484 1
> NULL  1
> 146   val_146 2
> 150   val_150 1
> 213   val_213 2
> NULL  1
> 238   val_238 2
> 255   val_255 2
> 273   val_273 3
> 278   val_278 2
> 311   val_311 3
> NULL  1
> 401   val_401 5
> 406   val_406 4
> 66val_66  1
> 98val_98  2
> {code}
> Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-10-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5697:
---

Attachment: HIVE-5697.2.patch

> Correlation Optimizer may generate wrong plans for cases involving outer join
> -
>
> Key: HIVE-5697
> URL: https://issues.apache.org/jira/browse/HIVE-5697
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-5697.1.patch, HIVE-5697.2.patch
>
>
> For example,
> {code:sql}
> select x.key, y.value, count(*) from src x right outer join src1 y on 
> (x.key=y.key and x.value=y.value) group by x.key, y.value; 
> {code}
> Correlation optimizer will determine that a single MR job is enough for this 
> query. However, the group by key are from both left and right tables of the 
> right outer join. 
> We will have a wrong result like
> {code}
> NULL  4
> NULL  val_165 1
> NULL  val_193 1
> NULL  val_265 1
> NULL  val_27  1
> NULL  val_409 1
> NULL  val_484 1
> NULL  1
> 146   val_146 2
> 150   val_150 1
> 213   val_213 2
> NULL  1
> 238   val_238 2
> 255   val_255 2
> 273   val_273 3
> 278   val_278 2
> 311   val_311 3
> NULL  1
> 401   val_401 5
> 406   val_406 4
> 66val_66  1
> 98val_98  2
> {code}
> Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-10-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5697:
---

Attachment: (was: HIVE-5697.2.patch)

> Correlation Optimizer may generate wrong plans for cases involving outer join
> -
>
> Key: HIVE-5697
> URL: https://issues.apache.org/jira/browse/HIVE-5697
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-5697.1.patch
>
>
> For example,
> {code:sql}
> select x.key, y.value, count(*) from src x right outer join src1 y on 
> (x.key=y.key and x.value=y.value) group by x.key, y.value; 
> {code}
> Correlation optimizer will determine that a single MR job is enough for this 
> query. However, the group by key are from both left and right tables of the 
> right outer join. 
> We will have a wrong result like
> {code}
> NULL  4
> NULL  val_165 1
> NULL  val_193 1
> NULL  val_265 1
> NULL  val_27  1
> NULL  val_409 1
> NULL  val_484 1
> NULL  1
> 146   val_146 2
> 150   val_150 1
> 213   val_213 2
> NULL  1
> 238   val_238 2
> 255   val_255 2
> 273   val_273 3
> 278   val_278 2
> 311   val_311 3
> NULL  1
> 401   val_401 5
> 406   val_406 4
> 66val_66  1
> 98val_98  2
> {code}
> Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-10-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5697:
---

Status: Open  (was: Patch Available)

> Correlation Optimizer may generate wrong plans for cases involving outer join
> -
>
> Key: HIVE-5697
> URL: https://issues.apache.org/jira/browse/HIVE-5697
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-5697.1.patch
>
>
> For example,
> {code:sql}
> select x.key, y.value, count(*) from src x right outer join src1 y on 
> (x.key=y.key and x.value=y.value) group by x.key, y.value; 
> {code}
> Correlation optimizer will determine that a single MR job is enough for this 
> query. However, the group by key are from both left and right tables of the 
> right outer join. 
> We will have a wrong result like
> {code}
> NULL  4
> NULL  val_165 1
> NULL  val_193 1
> NULL  val_265 1
> NULL  val_27  1
> NULL  val_409 1
> NULL  val_484 1
> NULL  1
> 146   val_146 2
> 150   val_150 1
> 213   val_213 2
> NULL  1
> 238   val_238 2
> 255   val_255 2
> 273   val_273 3
> 278   val_278 2
> 311   val_311 3
> NULL  1
> 401   val_401 5
> 406   val_406 4
> 66val_66  1
> 98val_98  2
> {code}
> Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-10-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5697:
---

Attachment: HIVE-5697.2.patch

added a test query

> Correlation Optimizer may generate wrong plans for cases involving outer join
> -
>
> Key: HIVE-5697
> URL: https://issues.apache.org/jira/browse/HIVE-5697
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-5697.1.patch, HIVE-5697.2.patch
>
>
> For example,
> {code:sql}
> select x.key, y.value, count(*) from src x right outer join src1 y on 
> (x.key=y.key and x.value=y.value) group by x.key, y.value; 
> {code}
> Correlation optimizer will determine that a single MR job is enough for this 
> query. However, the group by key are from both left and right tables of the 
> right outer join. 
> We will have a wrong result like
> {code}
> NULL  4
> NULL  val_165 1
> NULL  val_193 1
> NULL  val_265 1
> NULL  val_27  1
> NULL  val_409 1
> NULL  val_484 1
> NULL  1
> 146   val_146 2
> 150   val_150 1
> 213   val_213 2
> NULL  1
> 238   val_238 2
> 255   val_255 2
> 273   val_273 3
> 278   val_278 2
> 311   val_311 3
> NULL  1
> 401   val_401 5
> 406   val_406 4
> 66val_66  1
> 98val_98  2
> {code}
> Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-10-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5697:
---

Status: Patch Available  (was: Open)

> Correlation Optimizer may generate wrong plans for cases involving outer join
> -
>
> Key: HIVE-5697
> URL: https://issues.apache.org/jira/browse/HIVE-5697
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-5697.1.patch, HIVE-5697.2.patch
>
>
> For example,
> {code:sql}
> select x.key, y.value, count(*) from src x right outer join src1 y on 
> (x.key=y.key and x.value=y.value) group by x.key, y.value; 
> {code}
> Correlation optimizer will determine that a single MR job is enough for this 
> query. However, the group by key are from both left and right tables of the 
> right outer join. 
> We will have a wrong result like
> {code}
> NULL  4
> NULL  val_165 1
> NULL  val_193 1
> NULL  val_265 1
> NULL  val_27  1
> NULL  val_409 1
> NULL  val_484 1
> NULL  1
> 146   val_146 2
> 150   val_150 1
> 213   val_213 2
> NULL  1
> 238   val_238 2
> 255   val_255 2
> 273   val_273 3
> 278   val_278 2
> 311   val_311 3
> NULL  1
> 401   val_401 5
> 406   val_406 4
> 66val_66  1
> 98val_98  2
> {code}
> Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-10-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5697:
---

Attachment: HIVE-5697.1.patch

Will add test later.

> Correlation Optimizer may generate wrong plans for cases involving outer join
> -
>
> Key: HIVE-5697
> URL: https://issues.apache.org/jira/browse/HIVE-5697
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-5697.1.patch
>
>
> For example,
> {code:sql}
> select x.key, y.value, count(*) from src x right outer join src1 y on 
> (x.key=y.key and x.value=y.value) group by x.key, y.value; 
> {code}
> Correlation optimizer will determine that a single MR job is enough for this 
> query. However, the group by key are from both left and right tables of the 
> right outer join. 
> We will have a wrong result like
> {code}
> NULL  4
> NULL  val_165 1
> NULL  val_193 1
> NULL  val_265 1
> NULL  val_27  1
> NULL  val_409 1
> NULL  val_484 1
> NULL  1
> 146   val_146 2
> 150   val_150 1
> 213   val_213 2
> NULL  1
> 238   val_238 2
> 255   val_255 2
> 273   val_273 3
> 278   val_278 2
> 311   val_311 3
> NULL  1
> 401   val_401 5
> 406   val_406 4
> 66val_66  1
> 98val_98  2
> {code}
> Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-10-30 Thread Yin Huai (JIRA)
Yin Huai created HIVE-5697:
--

 Summary: Correlation Optimizer may generate wrong plans for cases 
involving outer join
 Key: HIVE-5697
 URL: https://issues.apache.org/jira/browse/HIVE-5697
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.12.0, 0.13.0
Reporter: Yin Huai
Assignee: Yin Huai


For example,
{code:sql}
select x.key, y.value, count(*) from src x right outer join src1 y on 
(x.key=y.key and x.value=y.value) group by x.key, y.value; 
{code}
Correlation optimizer will determine that a single MR job is enough for this 
query. However, the group by key are from both left and right tables of the 
right outer join. 

We will have a wrong result like
{code}
NULL4
NULLval_165 1
NULLval_193 1
NULLval_265 1
NULLval_27  1
NULLval_409 1
NULLval_484 1
NULL1
146 val_146 2
150 val_150 1
213 val_213 2
NULL1
238 val_238 2
255 val_255 2
273 val_273 3
278 val_278 2
311 val_311 3
NULL1
401 val_401 5
406 val_406 4
66  val_66  1
98  val_98  2
{code}
Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5697) Correlation Optimizer may generate wrong plans for cases involving outer join

2013-10-30 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5697:
---

Issue Type: Sub-task  (was: Bug)
Parent: HIVE-3667

> Correlation Optimizer may generate wrong plans for cases involving outer join
> -
>
> Key: HIVE-5697
> URL: https://issues.apache.org/jira/browse/HIVE-5697
> Project: Hive
>  Issue Type: Sub-task
>Affects Versions: 0.12.0, 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> For example,
> {code:sql}
> select x.key, y.value, count(*) from src x right outer join src1 y on 
> (x.key=y.key and x.value=y.value) group by x.key, y.value; 
> {code}
> Correlation optimizer will determine that a single MR job is enough for this 
> query. However, the group by key are from both left and right tables of the 
> right outer join. 
> We will have a wrong result like
> {code}
> NULL  4
> NULL  val_165 1
> NULL  val_193 1
> NULL  val_265 1
> NULL  val_27  1
> NULL  val_409 1
> NULL  val_484 1
> NULL  1
> 146   val_146 2
> 150   val_150 1
> 213   val_213 2
> NULL  1
> 238   val_238 2
> 255   val_255 2
> 273   val_273 3
> 278   val_278 2
> 311   val_311 3
> NULL  1
> 401   val_401 5
> 406   val_406 4
> 66val_66  1
> 98val_98  2
> {code}
> Rows with both x.key and y.value are null may not be grouped.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5610) Merge maven branch into trunk

2013-10-24 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13804379#comment-13804379
 ] 

Yin Huai commented on HIVE-5610:


my bad... I did not notice that... 

Tried again. The build worked great. Thanks Brock :)

> Merge maven branch into trunk
> -
>
> Key: HIVE-5610
> URL: https://issues.apache.org/jira/browse/HIVE-5610
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Brock Noland
>Assignee: Brock Noland
>
> With HIVE-5566 nearing completion we will be nearly ready to merge the maven 
> branch to trunk. The following tasks will be done post-merge:
> * HIVE-5611 - Add assembly (i.e.) tar creation to pom
> * HIVE-5612 - Add ability to re-generate generated code stored in source 
> control
> The merge process will be as follows:
> 1) svn merge ^/hive/branches/maven
> 2) Commit result
> 3) Modify the following line in maven-rollforward.sh:
> {noformat}
>   mv $source $target
> {noformat}
> to
> {noformat}
>   svn mv $source $target
> {noformat}
> 4) Execute maven-rollfward.sh
> 5) Commit result 
> 6) Update trunk-mr1.properties and trunk-mr2.properties on the ptesting host, 
> adding the following:
> {noformat}
> mavenEnvOpts = -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128 
> testCasePropertyName = test
> buildTool = maven
> unitTests.directories = ./
> {noformat}
> Notes:
> * To build everything you must:
> {noformat}
> $ mvn clean install -DskipTests
> $ cd itests
> $ mvn clean install -DskipTests
> {noformat}
> because itests (any tests that has cyclical dependencies or requires that the 
> packages be built) is not part of the root reactor build.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5610) Merge maven branch into trunk

2013-10-24 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13804341#comment-13804341
 ] 

Yin Huai commented on HIVE-5610:


Not an expert on maven. Here are what I tried...
I first tried 
{code}
mvn clean package -DskipTests
{code}
Then, I got the following error when maven was compiling Hive common
{code}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on 
project hive-common: Compilation failure: Compilation failure:
[ERROR] 
/home/yhuai/Projects/Hive/hive-trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:[43,36]
 package org.apache.hadoop.hive.shims does not exist
[ERROR] 
/home/yhuai/Projects/Hive/hive-trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:[1027,5]
 cannot find symbol
[ERROR] symbol  : variable ShimLoader
[ERROR] location: class org.apache.hadoop.hive.conf.HiveConf
[ERROR] 
/home/yhuai/Projects/Hive/hive-trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:[1271,34]
 cannot find symbol
[ERROR] symbol  : variable ShimLoader
[ERROR] location: class org.apache.hadoop.hive.conf.HiveConf
[ERROR] -> [Help 1]
{code} 
After I checked jars of shims, I found classes were not packed in those jars 
because of the dir structure. So, I set source dirs for those pom files in 
shims, e.g. 
{code}

${basedir}/../src/common-secure/java

${basedir}/../src/common-secure/test

{code}
Then, I got errors when maven was compiling tests of common-secure. For example,
{code}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-compiler-plugin:3.1:testCompile 
(default-testCompile) on project common-secure: Compilation failure: 
Compilation failure:
[ERROR] 
/home/yhuai/Projects/Hive/hive-trunk/shims/common-secure/../src/common-secure/test/org/apache/hadoop/hive/thrift/TestDBTokenStore.java:[26,54]
 package org.apache.hadoop.hive.metastore.HiveMetaStore does not exist
{code}
So, I asked maven to not compile tests 
{code}
mvn clean install -Dmaven.test.skip=true
{code}
Then, I got 
{code}
[ERROR] Failed to execute goal on project hive-service: Could not resolve 
dependencies for project org.apache.hive:hive-service:jar:0.13.0-SNAPSHOT: 
Could not find artifact org.apache.hive:hive-exec:jar:tests:0.13.0-SNAPSHOT -> 
[Help 1]
{code}
Seems the scope of hive-exec:jar:tests:0.13.0-SNAPSHOT in hive-service is test. 
Why did maven still try to resolve this dependency?

> Merge maven branch into trunk
> -
>
> Key: HIVE-5610
> URL: https://issues.apache.org/jira/browse/HIVE-5610
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Brock Noland
>Assignee: Brock Noland
>
> With HIVE-5566 nearing completion we will be nearly ready to merge the maven 
> branch to trunk. The following tasks will be done post-merge:
> * HIVE-5611 - Add assembly (i.e.) tar creation to pom
> * HIVE-5612 - Add ability to re-generate generated code stored in source 
> control
> The merge process will be as follows:
> 1) svn merge ^/hive/branches/maven
> 2) Commit result
> 3) Modify the following line in maven-rollforward.sh:
> {noformat}
>   mv $source $target
> {noformat}
> to
> {noformat}
>   svn mv $source $target
> {noformat}
> 4) Execute maven-rollfward.sh
> 5) Commit result 
> 6) Update trunk-mr1.properties and trunk-mr2.properties on the ptesting host, 
> adding the following:
> {noformat}
> mavenEnvOpts = -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128 
> testCasePropertyName = test
> buildTool = maven
> unitTests.directories = ./
> {noformat}
> Notes:
> * To build everything you must:
> {noformat}
> $ mvn clean install -DskipTests
> $ cd itests
> $ mvn clean install -DskipTests
> {noformat}
> because itests (any tests that has cyclical dependencies or requires that the 
> packages be built) is not part of the root reactor build.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5592) Add an option to convert enum as struct as of Hive 0.8

2013-10-21 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5592:
---

Description: 
HIVE-3323 introduced the incompatible change: Hive handling of enum types has 
been changed to always return the string value rather than struct. 
But it didn't add the option "hive.data.convert.enum.to.string"  as planned and 
thus broke all Enum usage prior to 0.10.


  was:
HIVE-3222 introduced the incompatible change: Hive handling of enum types has 
been changed to always return the string value rather than struct. 
But it didn't add the option "hive.data.convert.enum.to.string"  as planned and 
thus broke all Enum usage prior to 0.10.



> Add an option to convert enum as struct as of Hive 0.8
> -
>
> Key: HIVE-5592
> URL: https://issues.apache.org/jira/browse/HIVE-5592
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.10.0, 0.11.0, 0.12.0
>Reporter: Jie Li
>
> HIVE-3323 introduced the incompatible change: Hive handling of enum types has 
> been changed to always return the string value rather than struct. 
> But it didn't add the option "hive.data.convert.enum.to.string"  as planned 
> and thus broke all Enum usage prior to 0.10.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5546) A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391

2013-10-15 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5546:
---

Attachment: HIVE-5546.2.patch

Sure. I have removed includedStr (I kept the log of "included column ids ="). 
Thanks Sergey :)

> A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391
> --
>
> Key: HIVE-5546
> URL: https://issues.apache.org/jira/browse/HIVE-5546
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-5546.1.patch, HIVE-5546.2.patch
>
>
> {code}
> 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
> included column ids = 
> 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
> included columns names = 
> 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
> No ORC pushdown predicate
> 2013-10-15 10:49:49,834 INFO 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
> hdfs://localhost:54310/user/hive/warehouse/web_sales_orc/00_0
> 2013-10-15 10:49:49,834 INFO org.apache.hadoop.mapred.MapTask: 
> numReduceTasks: 1
> 2013-10-15 10:49:49,840 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 
> 100
> 2013-10-15 10:49:49,968 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
> Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
> 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: 
> Initialized cache for UID to User mapping with a cache timeout of 14400 
> seconds.
> 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
> UserName yhuai for UID 1000 from the native implementation
> 2013-10-15 10:49:49,996 FATAL org.apache.hadoop.mapred.Child: Error running 
> child : java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:949)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> {code}
> If includedColumnIds is an empty list, we do not need to read any column. 
> But, right now, in OrcInputFormat.findIncludedColumns, we have ...
> {code}
> if (ColumnProjectionUtils.isReadAllColumns(conf) ||
>   includedStr == null || includedStr.trim().length() == 0) {
>   return null;
> } 
> {code}
> If includedStr is an empty string, the code assumes that we need all columns, 
> which is not correct.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5546) A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391

2013-10-15 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5546:
---

Affects Version/s: 0.13.0

> A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391
> --
>
> Key: HIVE-5546
> URL: https://issues.apache.org/jira/browse/HIVE-5546
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-5546.1.patch
>
>
> {code}
> 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
> included column ids = 
> 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
> included columns names = 
> 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
> No ORC pushdown predicate
> 2013-10-15 10:49:49,834 INFO 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
> hdfs://localhost:54310/user/hive/warehouse/web_sales_orc/00_0
> 2013-10-15 10:49:49,834 INFO org.apache.hadoop.mapred.MapTask: 
> numReduceTasks: 1
> 2013-10-15 10:49:49,840 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 
> 100
> 2013-10-15 10:49:49,968 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
> Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
> 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: 
> Initialized cache for UID to User mapping with a cache timeout of 14400 
> seconds.
> 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
> UserName yhuai for UID 1000 from the native implementation
> 2013-10-15 10:49:49,996 FATAL org.apache.hadoop.mapred.Child: Error running 
> child : java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:949)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> {code}
> If includedColumnIds is an empty list, we do not need to read any column. 
> But, right now, in OrcInputFormat.findIncludedColumns, we have ...
> {code}
> if (ColumnProjectionUtils.isReadAllColumns(conf) ||
>   includedStr == null || includedStr.trim().length() == 0) {
>   return null;
> } 
> {code}
> If includedStr is an empty string, the code assumes that we need all columns, 
> which is not correct.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-2419) CREATE TABLE AS SELECT should create warehouse directory

2013-10-15 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13795317#comment-13795317
 ] 

Yin Huai commented on HIVE-2419:


Seems MoveTask.moveFile(Path, Path, boolean) throws this exception when it 
trying to rename the path.

> CREATE TABLE AS SELECT should create warehouse directory
> 
>
> Key: HIVE-2419
> URL: https://issues.apache.org/jira/browse/HIVE-2419
> Project: Hive
>  Issue Type: Bug
>Reporter: David Phillips
> Attachments: HIVE-2419.1.patch
>
>
> If you run a CTAS statement on a fresh Hive install without a warehouse 
> directory (as is the case with Amazon EMR), it runs the query but errors out 
> at the end:
> {quote}
> hive> create table foo as select * from t_message limit 1;
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> ...
> Ended Job = job_201108301753_0001
> Moving data to: 
> hdfs://ip-10-202-22-194.ec2.internal:9000/mnt/hive_07_1/warehouse/foo
> Failed with exception Unable to rename: 
> hdfs://ip-10-202-22-194.ec2.internal:9000/mnt/var/lib/hive_07_1/tmp/scratch/hive_2011-08-30_18-04-36_809_6130923980133666976/-ext-10001
>  to: hdfs://ip-10-202-22-194.ec2.internal:9000/mnt/hive_07_1/warehouse/foo
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.MoveTask
> {quote}
> This is different behavior from a simple CREATE TABLE, which creates the 
> warehouse directory.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5546) A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391

2013-10-15 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5546:
---

Status: Patch Available  (was: Open)

[~sershe] [~ashutoshc] Can you take a look?

> A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391
> --
>
> Key: HIVE-5546
> URL: https://issues.apache.org/jira/browse/HIVE-5546
> Project: Hive
>  Issue Type: Bug
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-5546.1.patch
>
>
> {code}
> 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
> included column ids = 
> 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
> included columns names = 
> 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
> No ORC pushdown predicate
> 2013-10-15 10:49:49,834 INFO 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
> hdfs://localhost:54310/user/hive/warehouse/web_sales_orc/00_0
> 2013-10-15 10:49:49,834 INFO org.apache.hadoop.mapred.MapTask: 
> numReduceTasks: 1
> 2013-10-15 10:49:49,840 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 
> 100
> 2013-10-15 10:49:49,968 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
> Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
> 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: 
> Initialized cache for UID to User mapping with a cache timeout of 14400 
> seconds.
> 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
> UserName yhuai for UID 1000 from the native implementation
> 2013-10-15 10:49:49,996 FATAL org.apache.hadoop.mapred.Child: Error running 
> child : java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:949)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> {code}
> If includedColumnIds is an empty list, we do not need to read any column. 
> But, right now, in OrcInputFormat.findIncludedColumns, we have ...
> {code}
> if (ColumnProjectionUtils.isReadAllColumns(conf) ||
>   includedStr == null || includedStr.trim().length() == 0) {
>   return null;
> } 
> {code}
> If includedStr is an empty string, the code assumes that we need all columns, 
> which is not correct.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (HIVE-5546) A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391

2013-10-15 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-5546:
---

Attachment: HIVE-5546.1.patch

Tried both
{code}
select count(1) from web_sales_orc;
{code}
and 
{code}
select count(*) from web_sales_orc;
{code}

Here is the results on a sf=1 TPC-DS dataset.
{code}
MapReduce Jobs Launched: 
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 3.96 sec   HDFS Read: 17112 HDFS 
Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 960 msec
{code}

> A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391
> --
>
> Key: HIVE-5546
> URL: https://issues.apache.org/jira/browse/HIVE-5546
> Project: Hive
>  Issue Type: Bug
>Reporter: Yin Huai
>Assignee: Yin Huai
> Attachments: HIVE-5546.1.patch
>
>
> {code}
> 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
> included column ids = 
> 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
> included columns names = 
> 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
> No ORC pushdown predicate
> 2013-10-15 10:49:49,834 INFO 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
> hdfs://localhost:54310/user/hive/warehouse/web_sales_orc/00_0
> 2013-10-15 10:49:49,834 INFO org.apache.hadoop.mapred.MapTask: 
> numReduceTasks: 1
> 2013-10-15 10:49:49,840 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 
> 100
> 2013-10-15 10:49:49,968 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
> Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
> 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: 
> Initialized cache for UID to User mapping with a cache timeout of 14400 
> seconds.
> 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
> UserName yhuai for UID 1000 from the native implementation
> 2013-10-15 10:49:49,996 FATAL org.apache.hadoop.mapred.Child: Error running 
> child : java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:949)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> {code}
> If includedColumnIds is an empty list, we do not need to read any column. 
> But, right now, in OrcInputFormat.findIncludedColumns, we have ...
> {code}
> if (ColumnProjectionUtils.isReadAllColumns(conf) ||
>   includedStr == null || includedStr.trim().length() == 0) {
>   return null;
> } 
> {code}
> If includedStr is an empty string, the code assumes that we need all columns, 
> which is not correct.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (HIVE-5546) A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391

2013-10-15 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13795280#comment-13795280
 ] 

Yin Huai commented on HIVE-5546:


Based on my understanding, I think that includedStr in 
OrcInputFormat.findIncludedColumns(List, Configuration) is null if and 
only if  ColumnProjectionUtils.isReadAllColumns(conf)=true. 

> A change in ORCInputFormat made by HIVE-4113 was reverted by HIVE-5391
> --
>
> Key: HIVE-5546
> URL: https://issues.apache.org/jira/browse/HIVE-5546
> Project: Hive
>  Issue Type: Bug
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> {code}
> 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
> included column ids = 
> 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
> included columns names = 
> 2013-10-15 10:49:49,386 INFO org.apache.hadoop.hive.ql.io.orc.OrcInputFormat: 
> No ORC pushdown predicate
> 2013-10-15 10:49:49,834 INFO 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
> hdfs://localhost:54310/user/hive/warehouse/web_sales_orc/00_0
> 2013-10-15 10:49:49,834 INFO org.apache.hadoop.mapred.MapTask: 
> numReduceTasks: 1
> 2013-10-15 10:49:49,840 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 
> 100
> 2013-10-15 10:49:49,968 INFO org.apache.hadoop.mapred.TaskLogsTruncater: 
> Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
> 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: 
> Initialized cache for UID to User mapping with a cache timeout of 14400 
> seconds.
> 2013-10-15 10:49:49,994 INFO org.apache.hadoop.io.nativeio.NativeIO: Got 
> UserName yhuai for UID 1000 from the native implementation
> 2013-10-15 10:49:49,996 FATAL org.apache.hadoop.mapred.Child: Error running 
> child : java.lang.OutOfMemoryError: Java heap space
>   at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:949)
>   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> {code}
> If includedColumnIds is an empty list, we do not need to read any column



--
This message was sent by Atlassian JIRA
(v6.1#6144)


  1   2   3   4   5   6   >