Hive On Spark - ORC Table - Hive Streaming Mutation API

Benjamin Schaff Wed, 14 Sep 2016 10:29:52 -0700

Hi,

After several days trying to figure out the problem I'm stuck with a class
cast exception when running a query with hive on spark on orc tables that I
updated with the streaming mutation api of hive 2.0.


The context is the following:

For hive:

The version is the latest available from the website 2.1
I created some scala code to insert data into an orc table with the
streaming mutation api followed the example provided somewhere in the hive
repository.

The table looks like that:

+--------------------------------------------------------------------+--+
|                           createtab_stmt                           |
+--------------------------------------------------------------------+--+
| CREATE TABLE `hc__member`(                                         |
|   `rdv_core__key` bigint,                                          |
|   `rdv_core__domainkey` string,                                    |
|   `rdftypes` array<string>,                                        |
|   `rdv_org__firstname` string,                                     |
|   `rdv_org__middlename` string,                                    |
|   `rdv_org__lastname` string,                                      |
|   `rdv_org__gender` string,                                        |
|   `rdv_org__city` string,                                          |
|   `rdv_org__state` string,                                         |
|   `rdv_org__countrycode` string,                                   |
|   `rdv_org__addresslabel` string,                                  |
|   `rdv_org__zip` string)                                           |
| CLUSTERED BY (                                                     |
|   rdv_core__key)                                                   |
| INTO 24 BUCKETS                                                    |
| ROW FORMAT SERDE                                                   |
|   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'                      |
| STORED AS INPUTFORMAT                                              |
|   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'                |
| OUTPUTFORMAT                                                       |
|   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'               |
| LOCATION                                                           |
|   'hdfs://hmaster:8020/user/hive/warehouse/hc__member'             |
| TBLPROPERTIES (                                                    |
|   'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',            |
|   'compactor.mapreduce.map.memory.mb'='2048',                      |
|   'compactorthreshold.hive.compactor.delta.num.threshold'='4',     |
|   'compactorthreshold.hive.compactor.delta.pct.threshold'='0.5',   |
|   'numFiles'='0',                                                  |
|   'numRows'='0',                                                   |
|   'rawDataSize'='0',                                               |
|   'totalSize'='0',                                                 |
|   'transactional'='true',                                          |
|   'transient_lastDdlTime'='1473792939')                            |
+--------------------------------------------------------------------+--+

The hive site looks like that:

<configuration>
 <property>
    <name>hive.execution.engine</name>
    <value>spark</value>
  </property>
  <property>
    <name>spark.master</name>
    <value>spark://hmaster:7077</value>
  </property>
  <property>
    <name>spark.eventLog.enabled</name>
    <value>false</value>
  </property>
  <property>
    <name>spark.executor.memory</name>
    <value>12g</value>
  </property>
  <property>
    <name>spark.serializer</name>
    <value>org.apache.spark.serializer.KryoSerializer</value>
  </property>
  <property>
    <name>mapreduce.input.fileinputformat.split.maxsize</name>
    <value>750000000</value>
  </property>
  <property>
    <name>hive.vectorized.execution.enabled</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.cbo.enable</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.optimize.reducededuplication.min.reducer</name>
    <value>4</value>
  </property>
  <property>
    <name>hive.optimize.reducededuplication</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.orc.splits.include.file.footer</name>
    <value>false</value>
  </property>
  <property>
    <name>hive.merge.mapfiles</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.merge.sparkfiles</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.merge.smallfiles.avgsize</name>
    <value>16000000</value>
  </property>
  <property>
    <name>hive.merge.size.per.task</name>
    <value>256000000</value>
  </property>
  <property>
    <name>hive.merge.orcfile.stripe.level</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.auto.convert.join</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.auto.convert.join.noconditionaltask</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.auto.convert.join.noconditionaltask.size</name>
    <value>894435328</value>
  </property>
  <property>
    <name>hive.optimize.bucketmapjoin.sortedmerge</name>
    <value>false</value>
  </property>
  <property>
    <name>hive.map.aggr.hash.percentmemory</name>
    <value>0.5</value>
  </property>
  <property>
    <name>hive.map.aggr</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.optimize.sort.dynamic.partition</name>
    <value>false</value>
  </property>
  <property>
    <name>hive.stats.autogather</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.stats.fetch.column.stats</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.vectorized.execution.reduce.enabled</name>
    <value>false</value>
  </property>
  <property>
    <name>hive.vectorized.groupby.checkinterval</name>
    <value>4096</value>
  </property>
  <property>
    <name>hive.vectorized.groupby.flush.percent</name>
    <value>0.1</value>
  </property>
  <property>
    <name>hive.compute.query.using.stats</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.limit.pushdown.memory.usage</name>
    <value>0.4</value>
  </property>
  <property>
    <name>hive.optimize.index.filter</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.exec.reducers.bytes.per.reducer</name>
    <value>67108864</value>
  </property>
  <property>
    <name>hive.smbjoin.cache.rows</name>
    <value>10000</value>
  </property>
  <property>
    <name>hive.exec.orc.default.stripe.size</name>
    <value>67108864</value>
  </property>
  <property>
    <name>hive.fetch.task.conversion</name>
    <value>more</value>
  </property>
  <property>
    <name>hive.fetch.task.conversion.threshold</name>
    <value>1073741824</value>
  </property>
  <property>
    <name>hive.fetch.task.aggr</name>
    <value>false</value>
  </property>
  <property>
    <name>mapreduce.input.fileinputformat.list-status.num-threads</name>
    <value>5</value>
  </property>
  <property>
    <name>spark.kryo.referenceTracking</name>
    <value>false</value>
  </property>
  <property>
    <name>spark.kryo.classesToRegister</name>

<value>org.apache.hadoop.hive.ql.io.HiveKey,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch</value>
  </property>
  <property>
    <name>hadoop.proxyuser.hive.groups</name>
    <value>*</value>
  </property>
  <property>
    <name>hadoop.proxyuser.hive.hosts</name>
    <value>*</value>
  </property>
  <property>
    <name>hive.server2.enable.doAs</name>
    <value>false</value>
  </property>
  <property>
    <name>hive.server2.authentication</name>
    <value>NONE</value>
  </property>
  <property>
    <name>hive.support.concurrency</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.exec.dynamic.partition.mode</name>
    <value>nonstrict</value>
  </property>
  <property>
    <name>hive.txn.manager</name>
    <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
  </property>
  <property>
    <name>hive.compactor.initiator.on</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.compactor.worker.threads</name>
    <value>4</value>
  </property>
  <property>
      <name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
      <description>metadata is stored in a MySQL server</description>
   </property>
   <property>
      <name>javax.jdo.option.ConnectionDriverName</name>
      <value>com.mysql.jdbc.Driver</value>
      <description>MySQL JDBC driver class</description>
   </property>
   <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>hadoop</value>
      <description>user name for connecting to mysql server</description>
   </property>
   <property>
      <name>javax.jdo.option.ConnectionPassword</name>
      <value></value>
      <description>password for connecting to mysql server</description>
   </property>
   <property>
    <name>hive.metastore.uris</name>
    <value>thrift://localhost:9083</value>
  </property>
  <property>
    <name>hive.root.logger</name>
    <value>WARN,RFA</value>
  </property>
</configuration>

Whenever I run a query involving spark I go the following error:

java.io.IOException: java.io.IOException: error iterating
        at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
        at 
org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
        at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:355)
        at 
org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
        at 
org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
        at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:246)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:208)
        at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
        at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at 
scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:29)
        at 
org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:93)
        at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
        at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error iterating
        at 
org.apache.hadoop.hive.ql.io.orc.VectorizedOrcAcidRowReader.next(VectorizedOrcAcidRowReader.java:92)
        at 
org.apache.hadoop.hive.ql.io.orc.VectorizedOrcAcidRowReader.next(VectorizedOrcAcidRowReader.java:42)
        at 
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:350)
        ... 18 more
Caused by: java.lang.ClassCastException:
org.apache.hadoop.hive.ql.io.orc.OrcStruct$OrcListObjectInspector
cannot be cast to
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector
        at 
org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.setVector(VectorizedBatchUtil.java:311)
        at 
org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.acidAddRowToBatch(VectorizedBatchUtil.java:291)
        at 
org.apache.hadoop.hive.ql.io.orc.VectorizedOrcAcidRowReader.next(VectorizedOrcAcidRowReader.java:82)
        ... 20 more


And what I mean by involving spark is that for a select * that does not run
against the spark backend I can see the data in the table but when I do a
query involving a group by for instance or just trying to extract a column
value, it runs through spark and I go this exception.

I also tried the Streaming API, with the same problem I tried a custom
writer and a jsonwriter.
I build myself the spark distribution removing hive related dependencies so
I don't think it comes from there.

Have you any recommendations on how I can proceed to find the root cause of
that problem ?

Thanks in advance.

PS: I made the mistake of posting on the dev mailing list earlier please
ignore it and sorry for the double post.

Regards,
Benjamin Schaff

Hive On Spark - ORC Table - Hive Streaming Mutation API

Reply via email to