Re: Exception with SparkSql and Avro

2014-09-23 Thread Michael Armbrust
Can you show me the DDL you are using?  Here is an example of a way I got
the avro serde to work:
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TestHive.scala#L246

Also, this isn't ready for primetime yet, but a quick plug for some ongoing
work: https://github.com/apache/spark/pull/2475

On Mon, Sep 22, 2014 at 10:07 PM, Zalzberg, Idan (Agoda) 
idan.zalzb...@agoda.com wrote:

  Hello,

 I am trying to read a hive table that is stored in Avro DEFLATE files.
 something simple like “SELECT * FROM X LIMIT 10”

 I get 2 exceptions in the logs:



 2014-09-23 09:27:50,157 WARN org.apache.spark.scheduler.TaskSetManager: Lost 
 task 10.0 in stage 1.0 (TID 10, cl.local): org.apache.avro.AvroTypeException: 
 Found com.a.bi.core.model.xxx.yyy, expecting 
 org.apache.hadoop.hive.CannotDetermineSchemaSentinel, missing required field 
 ERROR_ERROR_ERROR_ERROR_ERROR_ERROR_ERROR

 org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:231)

 org.apache.avro.io.parsing.Parser.advance(Parser.java:88)

 org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:127)

 org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:176)

 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)

 org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)

 org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)

 org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)

 org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.next(AvroGenericRecordReader.java:149)

 org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.next(AvroGenericRecordReader.java:52)

 org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:219)

 org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:188)

 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)

 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryColumnarTableScan.scala:74)

 org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:235)

 org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)

 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)

 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

 org.apache.spark.scheduler.Task.run(Task.scala:54)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

 java.lang.Thread.run(Thread.java:745)







 2014-09-23 09:27:49,152 WARN org.apache.spark.scheduler.TaskSetManager:
 Lost task 2.0 in stage 1.0 (TID 2, cl.local):
 org.apache.hadoop.hive.serde2.avro.BadSchemaException:

 org.apache.hadoop.hive.serde2.avro.AvroSerDe.deserialize(AvroSerDe.java:91)


 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:279)


 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:278)

 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)


 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryColumnarTableScan.scala:62)


 org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryColumnarTableScan.scala:50)

 org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:236)

 org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)

 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)

 

Re: Exception with SparkSql and Avro

2014-09-23 Thread Zalzberg, Idan (Agoda)
Thanks,
I didn't create the tables myself as I have no control over that process.
However these tables are read just fund using the Jdbc connection to the 
hiveserver2 so it should be possible

On Sep 24, 2014 12:48 AM, Michael Armbrust mich...@databricks.com wrote:
Can you show me the DDL you are using?  Here is an example of a way I got the 
avro serde to work: 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TestHive.scala#L246

Also, this isn't ready for primetime yet, but a quick plug for some ongoing 
work: https://github.com/apache/spark/pull/2475

On Mon, Sep 22, 2014 at 10:07 PM, Zalzberg, Idan (Agoda) 
idan.zalzb...@agoda.commailto:idan.zalzb...@agoda.com wrote:
Hello,
I am trying to read a hive table that is stored in Avro DEFLATE files.
something simple like “SELECT * FROM X LIMIT 10”
I get 2 exceptions in the logs:


2014-09-23 09:27:50,157 WARN org.apache.spark.scheduler.TaskSetManager: Lost 
task 10.0 in stage 1.0 (TID 10, cl.local): org.apache.avro.AvroTypeException: 
Found com.a.bi.core.model.xxx.yyy, expecting 
org.apache.hadoop.hive.CannotDetermineSchemaSentinel, missing required field 
ERROR_ERROR_ERROR_ERROR_ERROR_ERROR_ERROR

org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:231)

org.apache.avro.io.parsing.Parser.advance(Parser.java:88)

org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:127)

org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:176)

org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)

org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)

org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)

org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)

org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.next(AvroGenericRecordReader.java:149)

org.apache.hadoop.hive.ql.io.avro.AvroGenericRecordReader.next(AvroGenericRecordReader.java:52)

org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:219)

org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:188)

org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)

org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anon$1.hasNext(InMemoryColumnarTableScan.scala:74)

org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:235)

org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)

org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)

org.apache.spark.rdd.RDD.iterator(RDD.scala:227)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

java.lang.Thread.run(Thread.java:745)



2014-09-23 09:27:49,152 WARN org.apache.spark.scheduler.TaskSetManager: Lost 
task 2.0 in stage 1.0 (TID 2, cl.local): 
org.apache.hadoop.hive.serde2.avro.BadSchemaException:
org.apache.hadoop.hive.serde2.avro.AvroSerDe.deserialize(AvroSerDe.java:91)
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:279)
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:278)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryColumnarTableScan.scala:62)
org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryColumnarTableScan.scala:50)
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:236)