You are probably using an encoding that we don't support. I think this PR may be adding that support: https://github.com/apache/spark/pull/5422
On Sat, Apr 18, 2015 at 5:40 PM, Abhishek R. Singh < abhis...@tetrationanalytics.com> wrote: > I have created a bunch of protobuf based parquet files that I want to > read/inspect using Spark SQL. However, I am running into exceptions and not > able to proceed much further: > > This succeeds successfully (probably because there is no action yet). I > can also printSchema() and count() without any issues: > > scala> val df = sqlContext.load(“my_root_dir/201504101000", > "parquet") > > scala> df.select(df("summary")).first > > 15/04/18 17:03:03 WARN scheduler.TaskSetManager: Lost task 0.0 in stage > 5.0 (TID 27, xxx.yyy.com): parquet.io.ParquetDecodingException: Can not > read value at 0 in block -1 in file > hdfs://xxx.yyy.com:8020/my_root_dir/201504101000/00000.parquet > at > parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) > at > parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to > (TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: parquet.io.ParquetDecodingException: The requested schema is > not compatible with the file schema. incompatible types: optional group … > > > I could convert my protos into json and then back to parquet, but that > seems wasteful ! > > Also, I will be happy to contribute and make protobuf work with Spark SQL > if I can get some guidance/help/pointers. Help appreciated. > > -Abhishek- >