It doesn't seem to be Parquet 1.7.0 since the package name isn't under
"org.apache.parquet" (1.7.0 is the first official Apache release of
Parquet). The version you were using is probably Parquet 1.6.0rc3
according to the line number information:
https://github.com/apache/parquet-mr/blob/parquet-1.6.0rc3/parquet-hadoop/src/main/java/parquet/format/converter/ParquetMetadataConverter.java#L249
And you're hitting PARQUET-136, which has been fixed in (the real)
Parquet 1.7.0 https://issues.apache.org/jira/browse/PARQUET-136
Cheng
On 8/8/15 6:20 AM, Jerrick Hoang wrote:
Hi all,
I have a partitioned parquet table (very small table with only 2
partitions). The version of spark is 1.4.1, parquet version is 1.7.0.
I applied this patch to spark [SPARK-7743] so I assume that spark can
read parquet files normally, however, I'm getting this when trying to
do a simple `select count(*) from table`,
```org.apache.spark.SparkException: Job aborted due to stage failure:
Task 29 in stage 44.0 failed 15 times, most recent failure: Lost task
29.14 in stage 44.0: java.lang.NullPointerException
at
parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249)
at
parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543)
at
parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426)
at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
at
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155)
at
parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
at
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:153)
at
org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
at
org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)```
Has anybody seen this before?
Thanks