(Forgot to cc user mail list)

On 11/16/14 4:59 PM, Cheng Lian wrote:
Hey Sadhan,

Thanks for the additional information, this is helpful. Seems that some Parquet internal contract was broken, but I'm not sure whether it's caused by Spark SQL or Parquet, or even maybe the Parquet file itself was damaged somehow. I'm investigating this. In the meanwhile, would you mind to help to narrow down the problem by trying to scan exactly the same Parquet file with some other systems (e.g. Hive or Impala)? If other systems work, then there must be something wrong with Spark SQL.

Cheng

On Sun, Nov 16, 2014 at 1:19 PM, Sadhan Sood <sadhan.s...@gmail.com <mailto:sadhan.s...@gmail.com>> wrote:

    Hi Cheng,

    Thanks for your response. Here is the stack trace from yarn logs:

    Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
             at java.util.ArrayList.elementData(ArrayList.java:418)
             at java.util.ArrayList.get(ArrayList.java:431)
             at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
             at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
             at parquet.io.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:80)
             at parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:74)
             at 
parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:282)
             at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
             at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
             at 
parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
             at 
parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
             at 
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
             at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
             ... 26 more


    On Sat, Nov 15, 2014 at 9:28 AM, Cheng Lian <lian.cs....@gmail.com
    <mailto:lian.cs....@gmail.com>> wrote:

        Hi Sadhan,

        Could you please provide the stack trace of the
        |ArrayIndexOutOfBoundsException| (if any)? The reason why the
        first query succeeds is that Spark SQL doesn’t bother reading
        all data from the table to give |COUNT(*)|. In the second
        case, however, the whole table is asked to be cached lazily
        via the |cacheTable| call, thus it’s scanned to build the
        in-memory columnar cache. Then thing went wrong while scanning
        this LZO compressed Parquet file. But unfortunately the stack
        trace at hand doesn’t indicate the root cause.

        Cheng

        On 11/15/14 5:28 AM, Sadhan Sood wrote:

        While testing SparkSQL on a bunch of parquet files (basically
        used to be a partition for one of our hive tables), I
        encountered this error:

        import org.apache.spark.sql.SchemaRDD
        import org.apache.hadoop.fs.FileSystem;
        import org.apache.hadoop.conf.Configuration;
        import org.apache.hadoop.fs.Path;

        val sqlContext = new org.apache.spark.sql.SQLContext(sc)

        val parquetFileRDD = sqlContext.parquetFile(parquetFile)
        parquetFileRDD.registerTempTable("xyz_20141109")
        sqlContext.sql("SELECT count(*)  FROM
        xyz_20141109").collect() <-- works fine
        sqlContext.cacheTable("xyz_20141109")
        sqlContext.sql("SELECT count(*)  FROM
        xyz_20141109").collect() <-- fails with an exception

        parquet.io.ParquetDecodingException: Can not read value at 0
        in block -1 in file
        
hdfs://xxxxxxxx::9000/event_logs/xyz/20141109/part-00009359b87ae-a949-3ded-ac3e-3a6bda3a4f3a-r-00009.lzo.parquet

                at
        
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)

                at
        
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)

                at
        org.apache.spark.rdd.NewHadoopRDD$anon$1.hasNext(NewHadoopRDD.scala:145)

                at
        
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

                at
        scala.collection.Iterator$anon$11.hasNext(Iterator.scala:327)

                at
        scala.collection.Iterator$anon$14.hasNext(Iterator.scala:388)

                at
        
org.apache.spark.sql.columnar.InMemoryRelation$anonfun$3$anon$1.hasNext(InMemoryColumnarTableScan.scala:136)

                at
        org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:248)

                at
        org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)

                at
        org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)

                at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)

                at
        org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

                at
        org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)

                at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)

                at
        org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

                at
        org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)

                at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)

                at
        org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

                at
        org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)

                at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)

                at
        
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

                at
        
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

                at org.apache.spark.scheduler.Task.run(Task.scala:56)

                at
        org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)

                at
        
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

                at
        
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

                at java.lang.Thread.run(Thread.java:745)

        Caused by: java.lang.ArrayIndexOutOfBoundsException



Reply via email to