Rajat Jain created HIVE-11069:
---------------------------------

             Summary: ColumnStatsTask doesn't work with hive.exec.parallel
                 Key: HIVE-11069
                 URL: https://issues.apache.org/jira/browse/HIVE-11069
             Project: Hive
          Issue Type: Bug
    Affects Versions: 1.2.0
            Reporter: Rajat Jain


Try a simple query:

{code}
hive> set hive.exec.parallel=true;
hive> analyze table src compute statistics for columns;
{code}

It fails with errors similar to:

{code}
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.ColumnStatsTask
hive> java.lang.RuntimeException: Error caching map.xml: java.io.IOException: 
java.lang.InterruptedException
        at 
org.apache.hadoop.hive.ql.exec.Utilities.setBaseWork(Utilities.java:747)
        at 
org.apache.hadoop.hive.ql.exec.Utilities.setMapWork(Utilities.java:682)
        at 
org.apache.hadoop.hive.ql.exec.Utilities.setMapRedWork(Utilities.java:674)
        at 
org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:375)
        at 
org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:137)
        at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
        at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88)
        at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:75)
Caused by: java.io.IOException: java.lang.InterruptedException
        at org.apache.hadoop.ipc.Client.call(Client.java:1450)
        at org.apache.hadoop.ipc.Client.call(Client.java:1402)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
        at com.sun.proxy.$Proxy14.mkdirs(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:539)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy15.mkdirs(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2758)
        at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2729)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:870)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:866)
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:866)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:859)
        at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1954)
        at 
org.apache.hadoop.hive.ql.exec.Utilities.setPlanPath(Utilities.java:765)
        at 
org.apache.hadoop.hive.ql.exec.Utilities.setBaseWork(Utilities.java:691)
        ... 7 more
Caused by: java.lang.InterruptedException
        at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:400)
        at java.util.concurrent.FutureTask.get(FutureTask.java:187)
        at 
org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1049)
        at org.apache.hadoop.ipc.Client.call(Client.java:1444)
        ... 28 more
Job Submission failed with exception 'java.lang.RuntimeException(Error caching 
map.xml: java.io.IOException: java.lang.InterruptedException)'
{code}

The problem is the Column Stats Task doesn't depend on the root task which 
causes errors. Here's the explain output:

{code}
hive> explain analyze table src compute statistics for columns;
OK
STAGE DEPENDENCIES:
  Stage-0 is a root stage
  Stage-1 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: src
            Select Operator
              expressions: key (type: string), value (type: string)
              outputColumnNames: key, value
              Group By Operator
                aggregations: compute_stats(key, 16), compute_stats(value, 16)
                mode: hash
                outputColumnNames: _col0, _col1
                Reduce Output Operator
                  sort order:
                  value expressions: _col0 (type: 
struct<columntype:string,maxlength:bigint,sumlength:bigint,count:bigint,countnulls:bigint,bitvector:string,numbitvectors:int>),
 _col1 (type: 
struct<columntype:string,maxlength:bigint,sumlength:bigint,count:bigint,countnulls:bigint,bitvector:string,numbitvectors:int>)
      Reduce Operator Tree:
        Group By Operator
          aggregations: compute_stats(VALUE._col0), compute_stats(VALUE._col1)
          mode: mergepartial
          outputColumnNames: _col0, _col1
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-1
    Column Stats Work
      Column Stats Desc:
          Columns: key, value
          Column Types: string, string
          Table: default.src

Time taken: 0.761 seconds, Fetched: 39 row(s)
{code}

For reference, here's the corresponding output in Hive 0.13:

{code}
hive> explain analyze table orders compute statistics for columns;
OK
STAGE DEPENDENCIES:
  Stage-0 is a root stage
  Stage-1 depends on stages: Stage-0

STAGE PLANS:
  Stage: Stage-0
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: orders
            Statistics: Num rows: 0 Data size: 13310103552 Basic stats: PARTIAL 
Column stats: COMPLETE

  Stage: Stage-1
    Stats-Aggr Operator

Time taken: 2.142 seconds, Fetched: 15 row(s)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to