Yes! I was being dumb, should have caught that earlier, thank you Cheng Lian
On Fri, Aug 7, 2015 at 4:25 PM, Cheng Lian <lian.cs....@gmail.com> wrote: > It doesn't seem to be Parquet 1.7.0 since the package name isn't under > "org.apache.parquet" (1.7.0 is the first official Apache release of > Parquet). The version you were using is probably Parquet 1.6.0rc3 according > to the line number information: > https://github.com/apache/parquet-mr/blob/parquet-1.6.0rc3/parquet-hadoop/src/main/java/parquet/format/converter/ParquetMetadataConverter.java#L249 > And you're hitting PARQUET-136, which has been fixed in (the real) Parquet > 1.7.0 https://issues.apache.org/jira/browse/PARQUET-136 > > Cheng > > > On 8/8/15 6:20 AM, Jerrick Hoang wrote: > > Hi all, > > I have a partitioned parquet table (very small table with only 2 > partitions). The version of spark is 1.4.1, parquet version is 1.7.0. I > applied this patch to spark [SPARK-7743] so I assume that spark can read > parquet files normally, however, I'm getting this when trying to do a > simple `select count(*) from table`, > > ```org.apache.spark.SparkException: Job aborted due to stage failure: Task > 29 in stage 44.0 failed 15 times, most recent failure: Lost task 29.14 in > stage 44.0: java.lang.NullPointerException > at > parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249) > at > parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543) > at > parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520) > at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426) > at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381) > at > parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155) > at > parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138) > at > org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:153) > at > org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124) > at > org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745)``` > > Has anybody seen this before? > > Thanks > > >