[jira] [Created] (HIVE-6021) Problem in GroupByOperator for handling distinct aggrgations

Sun Rui (JIRA) Thu, 12 Dec 2013 03:32:49 -0800

Sun Rui created HIVE-6021:
-----------------------------

             Summary: Problem in GroupByOperator for handling distinct 
aggrgations
                 Key: HIVE-6021
                 URL: https://issues.apache.org/jira/browse/HIVE-6021
             Project: Hive
          Issue Type: Bug
          Components: Query Processor
    Affects Versions: 0.12.0
            Reporter: Sun Rui
            Assignee: Sun Rui



Use the following test case with HIVE 0.12:

{code:sql}
create table src(key int, value string);
load data local inpath 'src/data/files/kv1.txt' overwrite into table src;
set hive.map.aggr=false; 
select count(key),count(distinct value) from src group by key;
{code}

We will get an ArrayIndexOutOfBoundsException from GroupByOperator:
{code}
java.lang.RuntimeException: Error in configuring object
        at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
        at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
        at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
        at 
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:485)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:420)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
        ... 5 more
Caused by: java.lang.RuntimeException: Reduce operator initialization failed
        at 
org.apache.hadoop.hive.ql.exec.mr.ExecReducer.configure(ExecReducer.java:159)
        ... 10 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1
        at 
org.apache.hadoop.hive.ql.exec.GroupByOperator.initializeOp(GroupByOperator.java:281)
        at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:377)
        at 
org.apache.hadoop.hive.ql.exec.mr.ExecReducer.configure(ExecReducer.java:152)
        ... 10 more
{code}

explain select count(key),count(distinct value) from src group by key;
{code}
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        src 
          TableScan
            alias: src
            Select Operator
              expressions:
                    expr: key
                    type: int
                    expr: value
                    type: string
              outputColumnNames: key, value
              Reduce Output Operator
                key expressions:
                      expr: key
                      type: int
                      expr: value
                      type: string
                sort order: ++
                Map-reduce partition columns:
                      expr: key
                      type: int
                tag: -1
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(KEY._col0)   // The parameter causes this problem
                           ^^^^^^^^^^^                
                expr: count(DISTINCT KEY._col1:0._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: int
          mode: complete
          outputColumnNames: _col0, _col1, _col2
          Select Operator
            expressions:
                  expr: _col1
                  type: bigint
                  expr: _col2
                  type: bigint
            outputColumnNames: _col0, _col1
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1
{code}

The root cause is within GroupByOperator.initializeOp(). The method forgets to 
handle the case:
For a query has distinct aggregations, there is an aggregation function has a 
parameter which is a groupby key column but not distinct key column.

{code}
        if (unionExprEval != null) {
          String[] names = parameters.get(j).getExprString().split("\\.");
          // parameters of the form : KEY.colx:t.coly
          if (Utilities.ReduceField.KEY.name().equals(names[0])) {
            String name = names[names.length - 2];
            int tag = Integer.parseInt(name.split("\\:")[1]);
            
            ...
            
          } else {
            // will be VALUE._COLx
            if (!nonDistinctAggrs.contains(i)) {
              nonDistinctAggrs.add(i);
            }
          }
{code}




--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Created] (HIVE-6021) Problem in GroupByOperator for handling distinct aggrgations

Reply via email to