I'm trying to upgrade our spark (3.2.1 now) but with spark 3.3.0 and spark 3.2.2, we had error with specific parquet file
Is anyone else having the same problem as me? Or do I need to provide any information to the devs ? ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 4 times, most recent failure: Lost task 3.3 in stage 1.0 (TID 7) (10.113.39.118 executor 1): java.io.IOException: can not read class org.apache.parquet.format.PageHeader: don't know what type: 15 at org.apache.parquet.format.Util.read(Util.java:365) at org.apache.parquet.format.Util.readPageHeader(Util.java:132) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:1382) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1429) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1402) at org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1023) at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:928) at org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:972) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:338) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:293) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:196) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:191) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: don't know what type: 15 at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:894) at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:560) at org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:155) at shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:108) at shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:60) at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1100) at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1019) at org.apache.parquet.format.PageHeader.read(PageHeader.java:896) at org.apache.parquet.format.Util.read(Util.java:362) ... 32 more ``` similar to https://issues.apache.org/jira/browse/SPARK-11844, but we can reproduce error above *df = spark.read.parquet("specific_file.snappy.parquet")* *df.filter(df['event_name'] == "event1").dropDuplicates(["user_id",]).count()* with spark 3.2.1, we don't have this issue with spark 3.3.0/3.2.2,we can reproduce error above with spark 3.3.0/3.2.2 and set* spark.sql.parquet.filterPushdown=false*, the error disappear not sure if it releated to https://issues.apache.org/jira/browse/SPARK-39393 -- *camper42 (曹丰宇)* Douban, Inc. Mobile: +86 15691996359 E-mail: camper.x...@gmail.com