oh, I missed that. It is fixed in 1.3.0. Also, Jianshi, the dataset was not generated by Spark SQL, right?
On Fri, Apr 24, 2015 at 11:09 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Yin: > Fix Version of SPARK-4520 is not set. > I assume it was fixed in 1.3.0 > > Cheers > Fix Version > > On Fri, Apr 24, 2015 at 11:00 AM, Yin Huai <yh...@databricks.com> wrote: > >> The exception looks like the one mentioned in >> https://issues.apache.org/jira/browse/SPARK-4520. What is the version of >> Spark? >> >> On Fri, Apr 24, 2015 at 2:40 AM, Jianshi Huang <jianshi.hu...@gmail.com> >> wrote: >> >>> Hi, >>> >>> My data looks like this: >>> >>> +-----------+----------------------------+----------+ >>> | col_name | data_type | comment | >>> +-----------+----------------------------+----------+ >>> | cust_id | string | | >>> | part_num | int | | >>> | ip_list | array<struct<ip:string>> | | >>> | vid_list | array<struct<vid:string>> | | >>> | fso_list | array<struct<fso:string>> | | >>> | src | string | | >>> | date | int | | >>> +-----------+----------------------------+----------+ >>> >>> And I did select *, it reports ParquetDecodingException. >>> >>> Is this type not supported in SparkSQL? >>> >>> Detailed error message here: >>> >>> >>> Error: org.apache.spark.SparkException: Job aborted due to stage failure: >>> Task 0 in stage 27.0 failed 4 times, most recent failure: Lost task 0.3 in >>> stage 27.0 (TID 510, lvshdc5dn0542.lvs.paypal.com): >>> parquet.io.ParquetDecodingException: >>> Can not read value at 0 in block -1 in file >>> hdfs://xxx/part-m-00000.gz.parquet >>> at >>> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) >>> at >>> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) >>> at >>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143) >>> at >>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) >>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) >>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) >>> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) >>> at scala.collection.Iterator$class.foreach(Iterator.scala:727) >>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) >>> at >>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) >>> at >>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) >>> at >>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) >>> at >>> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) >>> at scala.collection.AbstractIterator.to(Iterator.scala:1157) >>> at >>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) >>> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) >>> at >>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) >>> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) >>> at >>> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122) >>> at >>> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122) >>> at >>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498) >>> at >>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498) >>> at >>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) >>> at org.apache.spark.scheduler.Task.run(Task.scala:64) >>> at >>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>> at java.lang.Thread.run(Thread.java:724) >>> Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 >>> at java.util.ArrayList.elementData(ArrayList.java:400) >>> at java.util.ArrayList.get(ArrayList.java:413) >>> at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95) >>> at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95) >>> at parquet.io.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:80) >>> at parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:74) >>> at >>> parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:290) >>> at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131) >>> at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96) >>> at >>> parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136) >>> at >>> parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96) >>> at >>> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126) >>> at >>> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193) >>> >>> >>> >>> >>> -- >>> Jianshi Huang >>> >>> LinkedIn: jianshi >>> Twitter: @jshuang >>> Github & Blog: http://huangjs.github.com/ >>> >> >> >