Re: Hive On Spark - ORC Table - Hive Streaming Mutation API

Mich Talebzadeh Wed, 14 Sep 2016 10:55:58 -0700

Hi,

You are using Hive 2. What is the Spark version that runs as Hive execution
engine?


I cannot see spark.home in your hive-site.xml so I cannot figure it out.

BTW you are using Spark standalone as the mode. I tend to use yarn-client.

Now back to the above issue. Do other queries work OK with Hive on Spark?

Some of those perf parameters can be set up in Hive session itself or
through init file

 set spark.home=/usr/lib/spark-1.6.2-bin-hadoop2.6;
set spark.master=yarn;
set spark.deploy.mode=client;
set spark.executor.memory=8g;
set spark.driver.memory=8g;
set spark.executor.instances=6;
set spark.ui.port=7777;


HTH








Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 September 2016 at 18:28, Benjamin Schaff <[email protected]>
wrote:

> Hi,
>
> After several days trying to figure out the problem I'm stuck with a class
> cast exception when running a query with hive on spark on orc tables that I
> updated with the streaming mutation api of hive 2.0.
>
> The context is the following:
>
> For hive:
>
> The version is the latest available from the website 2.1
> I created some scala code to insert data into an orc table with the
> streaming mutation api followed the example provided somewhere in the hive
> repository.
>
> The table looks like that:
>
> +--------------------------------------------------------------------+--+
> |                           createtab_stmt                           |
> +--------------------------------------------------------------------+--+
> | CREATE TABLE `hc__member`(                                         |
> |   `rdv_core__key` bigint,                                          |
> |   `rdv_core__domainkey` string,                                    |
> |   `rdftypes` array<string>,                                        |
> |   `rdv_org__firstname` string,                                     |
> |   `rdv_org__middlename` string,                                    |
> |   `rdv_org__lastname` string,                                      |
> |   `rdv_org__gender` string,                                        |
> |   `rdv_org__city` string,                                          |
> |   `rdv_org__state` string,                                         |
> |   `rdv_org__countrycode` string,                                   |
> |   `rdv_org__addresslabel` string,                                  |
> |   `rdv_org__zip` string)                                           |
> | CLUSTERED BY (                                                     |
> |   rdv_core__key)                                                   |
> | INTO 24 BUCKETS                                                    |
> | ROW FORMAT SERDE                                                   |
> |   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'                      |
> | STORED AS INPUTFORMAT                                              |
> |   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'                |
> | OUTPUTFORMAT                                                       |
> |   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'               |
> | LOCATION                                                           |
> |   'hdfs://hmaster:8020/user/hive/warehouse/hc__member'             |
> | TBLPROPERTIES (                                                    |
> |   'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}',            |
> |   'compactor.mapreduce.map.memory.mb'='2048',                      |
> |   'compactorthreshold.hive.compactor.delta.num.threshold'='4',     |
> |   'compactorthreshold.hive.compactor.delta.pct.threshold'='0.5',   |
> |   'numFiles'='0',                                                  |
> |   'numRows'='0',                                                   |
> |   'rawDataSize'='0',                                               |
> |   'totalSize'='0',                                                 |
> |   'transactional'='true',                                          |
> |   'transient_lastDdlTime'='1473792939')                            |
> +--------------------------------------------------------------------+--+
>
> The hive site looks like that:
>
> <configuration>
>  <property>
>     <name>hive.execution.engine</name>
>     <value>spark</value>
>   </property>
>   <property>
>     <name>spark.master</name>
>     <value>spark://hmaster:7077</value>
>   </property>
>   <property>
>     <name>spark.eventLog.enabled</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>spark.executor.memory</name>
>     <value>12g</value>
>   </property>
>   <property>
>     <name>spark.serializer</name>
>     <value>org.apache.spark.serializer.KryoSerializer</value>
>   </property>
>   <property>
>     <name>mapreduce.input.fileinputformat.split.maxsize</name>
>     <value>750000000</value>
>   </property>
>   <property>
>     <name>hive.vectorized.execution.enabled</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>hive.cbo.enable</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>hive.optimize.reducededuplication.min.reducer</name>
>     <value>4</value>
>   </property>
>   <property>
>     <name>hive.optimize.reducededuplication</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>hive.orc.splits.include.file.footer</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>hive.merge.mapfiles</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>hive.merge.sparkfiles</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>hive.merge.smallfiles.avgsize</name>
>     <value>16000000</value>
>   </property>
>   <property>
>     <name>hive.merge.size.per.task</name>
>     <value>256000000</value>
>   </property>
>   <property>
>     <name>hive.merge.orcfile.stripe.level</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>hive.auto.convert.join</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>hive.auto.convert.join.noconditionaltask</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>hive.auto.convert.join.noconditionaltask.size</name>
>     <value>894435328</value>
>   </property>
>   <property>
>     <name>hive.optimize.bucketmapjoin.sortedmerge</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>hive.map.aggr.hash.percentmemory</name>
>     <value>0.5</value>
>   </property>
>   <property>
>     <name>hive.map.aggr</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>hive.optimize.sort.dynamic.partition</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>hive.stats.autogather</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>hive.stats.fetch.column.stats</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>hive.vectorized.execution.reduce.enabled</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>hive.vectorized.groupby.checkinterval</name>
>     <value>4096</value>
>   </property>
>   <property>
>     <name>hive.vectorized.groupby.flush.percent</name>
>     <value>0.1</value>
>   </property>
>   <property>
>     <name>hive.compute.query.using.stats</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>hive.limit.pushdown.memory.usage</name>
>     <value>0.4</value>
>   </property>
>   <property>
>     <name>hive.optimize.index.filter</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>hive.exec.reducers.bytes.per.reducer</name>
>     <value>67108864</value>
>   </property>
>   <property>
>     <name>hive.smbjoin.cache.rows</name>
>     <value>10000</value>
>   </property>
>   <property>
>     <name>hive.exec.orc.default.stripe.size</name>
>     <value>67108864</value>
>   </property>
>   <property>
>     <name>hive.fetch.task.conversion</name>
>     <value>more</value>
>   </property>
>   <property>
>     <name>hive.fetch.task.conversion.threshold</name>
>     <value>1073741824</value>
>   </property>
>   <property>
>     <name>hive.fetch.task.aggr</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>mapreduce.input.fileinputformat.list-status.num-threads</name>
>     <value>5</value>
>   </property>
>   <property>
>     <name>spark.kryo.referenceTracking</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>spark.kryo.classesToRegister</name>
>     <value>org.apache.hadoop.hive.ql.io.HiveKey,org.apache.
> hadoop.io.BytesWritable,org.apache.hadoop.hive.ql.exec.
> vector.VectorizedRowBatch</value>
>   </property>
>   <property>
>     <name>hadoop.proxyuser.hive.groups</name>
>     <value>*</value>
>   </property>
>   <property>
>     <name>hadoop.proxyuser.hive.hosts</name>
>     <value>*</value>
>   </property>
>   <property>
>     <name>hive.server2.enable.doAs</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>hive.server2.authentication</name>
>     <value>NONE</value>
>   </property>
>   <property>
>     <name>hive.support.concurrency</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>hive.exec.dynamic.partition.mode</name>
>     <value>nonstrict</value>
>   </property>
>   <property>
>     <name>hive.txn.manager</name>
>     <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
>   </property>
>   <property>
>     <name>hive.compactor.initiator.on</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>hive.compactor.worker.threads</name>
>     <value>4</value>
>   </property>
>   <property>
>       <name>javax.jdo.option.ConnectionURL</name>
>       <value>jdbc:mysql://localhost/metastore?
> createDatabaseIfNotExist=true</value>
>       <description>metadata is stored in a MySQL server</description>
>    </property>
>    <property>
>       <name>javax.jdo.option.ConnectionDriverName</name>
>       <value>com.mysql.jdbc.Driver</value>
>       <description>MySQL JDBC driver class</description>
>    </property>
>    <property>
>       <name>javax.jdo.option.ConnectionUserName</name>
>       <value>hadoop</value>
>       <description>user name for connecting to mysql server</description>
>    </property>
>    <property>
>       <name>javax.jdo.option.ConnectionPassword</name>
>       <value></value>
>       <description>password for connecting to mysql server</description>
>    </property>
>    <property>
>     <name>hive.metastore.uris</name>
>     <value>thrift://localhost:9083</value>
>   </property>
>   <property>
>     <name>hive.root.logger</name>
>     <value>WARN,RFA</value>
>   </property>
> </configuration>
>
> Whenever I run a query involving spark I go the following error:
>
> java.io.IOException: java.io.IOException: error iterating
>       at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
>       at 
> org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:77)
>       at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:355)
>       at 
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
>       at 
> org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
>       at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:116)
>       at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:246)
>       at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:208)
>       at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>       at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>       at 
> scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:29)
>       at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:93)
>       at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
>       at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>       at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>       at org.apache.spark.scheduler.Task.run(Task.scala:89)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: error iterating
>       at 
> org.apache.hadoop.hive.ql.io.orc.VectorizedOrcAcidRowReader.next(VectorizedOrcAcidRowReader.java:92)
>       at 
> org.apache.hadoop.hive.ql.io.orc.VectorizedOrcAcidRowReader.next(VectorizedOrcAcidRowReader.java:42)
>       at 
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:350)
>       ... 18 more
> Caused by: java.lang.ClassCastException: 
> org.apache.hadoop.hive.ql.io.orc.OrcStruct$OrcListObjectInspector cannot be 
> cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector
>       at 
> org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.setVector(VectorizedBatchUtil.java:311)
>       at 
> org.apache.hadoop.hive.ql.exec.vector.VectorizedBatchUtil.acidAddRowToBatch(VectorizedBatchUtil.java:291)
>       at 
> org.apache.hadoop.hive.ql.io.orc.VectorizedOrcAcidRowReader.next(VectorizedOrcAcidRowReader.java:82)
>       ... 20 more
>
>
> And what I mean by involving spark is that for a select * that does not
> run against the spark backend I can see the data in the table but when I do
> a query involving a group by for instance or just trying to extract a
> column value, it runs through spark and I go this exception.
>
> I also tried the Streaming API, with the same problem I tried a custom
> writer and a jsonwriter.
> I build myself the spark distribution removing hive related dependencies
> so I don't think it comes from there.
>
> Have you any recommendations on how I can proceed to find the root cause
> of that problem ?
>
> Thanks in advance.
>
> PS: I made the mistake of posting on the dev mailing list earlier please
> ignore it and sorry for the double post.
>
> Regards,
> Benjamin Schaff
>
>

Re: Hive On Spark - ORC Table - Hive Streaming Mutation API

Reply via email to