Had an offline discussion with Jianshi, the dataset was generated by Pig.
Jianshi - Could you please attach the output of "parquet-schema
<path-to-parquet-file>"? I guess this is a Parquet format
backwards-compatibility issue. Parquet hadn't standardized
representation of LIST and MAP until recently, thus many systems made
their own choice and are not easily inter-operatable. In earlier days,
Spark SQL used LIST and MAP formats similar to Avro, which was
unfortunately not chosen as the current standard format. Details can be
found here:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
This document also defines backwards-compatibility rules to handle
legacy Parquet data written by old Parquet implementations in various
systems.
So ideally, now Spark SQL should always write data following the
standard, and implement all backwards-compatibility rules to read legacy
data. I'm working on a PR https://github.com/apache/spark/pull/5422 for
this. To fix this issue, we need to implement backwards-compatibility
rules in both record converter and schema converter. This PR has fixed
the former, but I still need some time to finish the latter part and add
tests.
Cheng
On 4/25/15 2:22 AM, Yin Huai wrote:
oh, I missed that. It is fixed in 1.3.0.
Also, Jianshi, the dataset was not generated by Spark SQL, right?
On Fri, Apr 24, 2015 at 11:09 AM, Ted Yu <yuzhih...@gmail.com
<mailto:yuzhih...@gmail.com>> wrote:
Yin:
Fix Version of SPARK-4520 is not set.
I assume it was fixed in 1.3.0
Cheers
Fix Version
On Fri, Apr 24, 2015 at 11:00 AM, Yin Huai <yh...@databricks.com
<mailto:yh...@databricks.com>> wrote:
The exception looks like the one mentioned in
https://issues.apache.org/jira/browse/SPARK-4520. What is the
version of Spark?
On Fri, Apr 24, 2015 at 2:40 AM, Jianshi Huang
<jianshi.hu...@gmail.com <mailto:jianshi.hu...@gmail.com>> wrote:
Hi,
My data looks like this:
+-----------+----------------------------+----------+
| col_name | data_type | comment |
+-----------+----------------------------+----------+
| cust_id | string | |
| part_num | int | |
| ip_list | array<struct<ip:string>> | |
| vid_list | array<struct<vid:string>> | |
| fso_list | array<struct<fso:string>> | |
| src | string | |
| date | int | |
+-----------+----------------------------+----------+
And I did select *, it reports ParquetDecodingException.
Is this type not supported in SparkSQL?
Detailed error message here:
Error: org.apache.spark.SparkException: Job aborted due to
stage failure: Task 0 in stage 27.0 failed 4 times, most
recent failure: Lost task 0.3 in stage 27.0 (TID 510,
lvshdc5dn0542.lvs.paypal.com
<http://lvshdc5dn0542.lvs.paypal.com>):
parquet.io.ParquetDecodingException:
Can not read value at 0 in block -1 in file
hdfs://xxx/part-m-00000.gz.parquet
at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at
scala.collection.Iterator$class.foreach(Iterator.scala:727)
at
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to
<http://class.to>(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to
<http://scala.collection.AbstractIterator.to>(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at
scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at
scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.ArrayList.elementData(ArrayList.java:400)
at java.util.ArrayList.get(ArrayList.java:413)
at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
at
parquet.io.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:80)
at
parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:74)
at
parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:290)
at
parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
at
parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
at
parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
at
parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
at
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/