Thanks,
I didn't create the tables myself as I have no control over that process.
However these tables are read just fund using the Jdbc connection to the
hiveserver2 so it should be possible
On Sep 24, 2014 12:48 AM, Michael Armbrust mich...@databricks.com wrote:
Can you show me the DDL you are using? Here is an example of a way I got the
avro serde to work:
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TestHive.scala#L246
Also, this isn't ready for primetime yet, but a quick plug for some ongoing
work: https://github.com/apache/spark/pull/2475
On Mon, Sep 22, 2014 at 10:07 PM, Zalzberg, Idan (Agoda)
idan.zalzb...@agoda.commailto:idan.zalzb...@agoda.com wrote:
Hello,
I am trying to read a hive table that is stored in Avro DEFLATE files.
something simple like “SELECT * FROM X LIMIT 10”
I get 2 exceptions in the logs:
2014-09-23 09:27:50,157 WARN org.apache.spark.scheduler.TaskSetManager: Lost
task 10.0 in stage 1.0 (TID 10, cl.local): org.apache.avro.AvroTypeException:
Found com.a.bi.core.model.xxx.yyy, expecting
org.apache.hadoop.hive.CannotDetermineSchemaSentinel, missing required field
ERROR_ERROR_ERROR_ERROR_ERROR_ERROR_ERROR
org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:231)
org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:127)
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:176)
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.next(AvroGenericRecordReader.java:149)
org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.next(AvroGenericRecordReader.java:52)
org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:219)
org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:188)
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryColumnarTableScan.scala:74)
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:235)
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
2014-09-23 09:27:49,152 WARN org.apache.spark.scheduler.TaskSetManager: Lost
task 2.0 in stage 1.0 (TID 2, cl.local):
org.apache.hadoop.hive.serde2.avro.BadSchemaException:
org.apache.hadoop.hive.serde2.avro.AvroSerDe.deserialize(AvroSerDe.java:91)
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:279)
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:278)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryColumnarTableScan.scala:62)
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryColumnarTableScan.scala:50)
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:236)