Had an offline discussion with Jianshi, the dataset was generated by Pig.

Jianshi - Could you please attach the output of "parquet-schema <path-to-parquet-file>"? I guess this is a Parquet format backwards-compatibility issue. Parquet hadn't standardized representation of LIST and MAP until recently, thus many systems made their own choice and are not easily inter-operatable. In earlier days, Spark SQL used LIST and MAP formats similar to Avro, which was unfortunately not chosen as the current standard format. Details can be found here: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md This document also defines backwards-compatibility rules to handle legacy Parquet data written by old Parquet implementations in various systems.

So ideally, now Spark SQL should always write data following the standard, and implement all backwards-compatibility rules to read legacy data. JIRA issue for this is https://issues.apache.org/jira/browse/SPARK-6774

I'm working on a PR https://github.com/apache/spark/pull/5422 for this. To fix SPARK-6774, we need to implement backwards-compatibility rules in both record converter and schema converter together. This PR has fixed the former, but I still need some time to finish the latter part and add tests.

Cheng

On 4/25/15 2:22 AM, Yin Huai wrote:
oh, I missed that. It is fixed in 1.3.0.

Also, Jianshi, the dataset was not generated by Spark SQL, right?

On Fri, Apr 24, 2015 at 11:09 AM, Ted Yu <yuzhih...@gmail.com <mailto:yuzhih...@gmail.com>> wrote:

    Yin:
    Fix Version of SPARK-4520 is not set.
    I assume it was fixed in 1.3.0

    Cheers
    Fix Version

    On Fri, Apr 24, 2015 at 11:00 AM, Yin Huai <yh...@databricks.com
    <mailto:yh...@databricks.com>> wrote:

        The exception looks like the one mentioned in
        https://issues.apache.org/jira/browse/SPARK-4520. What is the
        version of Spark?

        On Fri, Apr 24, 2015 at 2:40 AM, Jianshi Huang
        <jianshi.hu...@gmail.com <mailto:jianshi.hu...@gmail.com>> wrote:

            Hi,

            My data looks like this:

            +-----------+----------------------------+----------+
            | col_name | data_type | comment |
            +-----------+----------------------------+----------+
            | cust_id | string | |
            | part_num | int | |
            | ip_list | array<struct<ip:string>> | |
            | vid_list | array<struct<vid:string>> | |
            | fso_list | array<struct<fso:string>> | |
            | src | string | |
            | date | int | |
            +-----------+----------------------------+----------+
            And I did select *, it reports ParquetDecodingException.
            Is this type not supported in SparkSQL?
            Detailed error message here:

            Error: org.apache.spark.SparkException: Job aborted due to
            stage failure: Task 0 in stage 27.0 failed 4 times, most
            recent failure: Lost task 0.3 in stage 27.0 (TID 510,
            lvshdc5dn0542.lvs.paypal.com
            <http://lvshdc5dn0542.lvs.paypal.com>):
            parquet.io.ParquetDecodingException:
            Can not read value at 0 in block -1 in file
            hdfs://xxx/part-m-00000.gz.parquet
            at
            
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)

            at
            
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)

            at
            
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)

            at
            
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

            at
            scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

            at
            scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

            at
            scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)

            at
            scala.collection.Iterator$class.foreach(Iterator.scala:727)
            at
            scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

            at
            
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

            at
            
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

            at
            
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

            at scala.collection.TraversableOnce$class.to
            <http://class.to>(TraversableOnce.scala:273)
            at scala.collection.AbstractIterator.to
            <http://scala.collection.AbstractIterator.to>(Iterator.scala:1157)

            at
            
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)

            at
            scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

            at
            
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)

            at
            scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

            at
            
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122)

            at
            
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122)

            at
            
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)

            at
            
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)

            at
            org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)

            at org.apache.spark.scheduler.Task.run(Task.scala:64)
            at
            
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)

            at
            
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

            at
            
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

            at java.lang.Thread.run(Thread.java:724)
            Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
            at java.util.ArrayList.elementData(ArrayList.java:400)
            at java.util.ArrayList.get(ArrayList.java:413)
            at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
            at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
            at
            parquet.io.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:80)

            at
            parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:74)

            at
            
parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:290)

            at
            parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
            at
            parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
            at
            
parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)

            at
            parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)

            at
            
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)

            at
            
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)


-- Jianshi Huang

            LinkedIn: jianshi
            Twitter: @jshuang
            Github & Blog: http://huangjs.github.com/





Reply via email to