[jira] [Updated] (HIVE-5891) Alias conflict when merging multiple mapjoin tasks into their common child mapred task

Sun Rui (JIRA) Fri, 20 Dec 2013 05:07:19 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sun Rui updated HIVE-5891:
--------------------------

    Attachment: HIVE-5891.2.patch

> Alias conflict when merging multiple mapjoin tasks into their common child 
> mapred task
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-5891
>                 URL: https://issues.apache.org/jira/browse/HIVE-5891
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.12.0
>            Reporter: Sun Rui
>            Assignee: Sun Rui
>         Attachments: HIVE-5891.1.patch, HIVE-5891.2.patch
>
>
> Use the following test case with HIVE 0.12:
> {code:sql}
> create table src(key int, value string);
> load data local inpath 'src/data/files/kv1.txt' overwrite into table src;
> select * from (
>   select c.key from
>     (select a.key from src a join src b on a.key=b.key group by a.key) tmp
>     join src c on tmp.key=c.key
>   union all
>   select c.key from
>     (select a.key from src a join src b on a.key=b.key group by a.key) tmp
>     join src c on tmp.key=c.key
> ) x;
> {code}
> We will get a NullPointerException from Union Operator:
> {noformat}
> java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> Hive Runtime Error while processing row {"_col0":0}
>       at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:175)
>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>       at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>       at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime 
> Error while processing row {"_col0":0}
>       at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:544)
>       at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:157)
>       ... 4 more
> Caused by: java.lang.NullPointerException
>       at 
> org.apache.hadoop.hive.ql.exec.UnionOperator.processOp(UnionOperator.java:120)
>       at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>       at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>       at 
> org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:88)
>       at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>       at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>       at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:652)
>       at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genUniqueJoinObject(CommonJoinOperator.java:655)
>       at 
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject(CommonJoinOperator.java:758)
>       at 
> org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:220)
>       at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>       at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>       at 
> org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:91)
>       at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:504)
>       at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:842)
>       at 
> org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:534)
>       ... 5 more
> {noformat}
>   
> The root cause is in 
> CommonJoinTaskDispatcher.mergeMapJoinTaskIntoItsChildMapRedTask().
> {noformat}
>   +--------------+      +--------------+
>   | MapJoin task |      | MapJoin task |
>   +--------------+      +--------------+
>              \             /
>               \           /
>              +--------------+
>              |  Union task  |
>              +--------------+
> {noformat}             
> CommonJoinTaskDispatcher merges the two MapJoin tasks into their common 
> child: Union task. The two MapJoin tasks have the same alias name for their 
> big tables: $INTNAME, which is the name of the temporary table of a join 
> stream. The aliasToWork map uses alias as key, so eventually only the MapJoin 
> operator tree of one MapJoin task is saved into the aliasToWork map of the 
> Union task, while the MapJoin operator tree of another MapJoin task is lost. 
> As a result, Union operator won't be initialized because not all of its 
> parents gets intialized (The Union operator itself indicates it has two 
> parents, but actually it has only 1 parent because another parent is lost).
> This issue does not exist in HIVE 0.11 and thus is a regression bug in HIVE 
> 0.12.
> The propsed solution is to use the query ID as prefix for the join stream 
> name to avoid conflict and add sanity check code in CommonJoinTaskDispatcher 
> that merge of a MapJoin task into its child MapRed task is skipped if there 
> is any alias conflict. Please review the patch. I am not sure if the patch 
> properly handles the case of DemuxOperator.
> BTW, anyone knows the origin of "$INTNAME"? it is so confusing, maybe we can 
> replace it with a meaningful name.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Updated] (HIVE-5891) Alias conflict when merging multiple mapjoin tasks into their common child mapred task

Reply via email to