Hi Fengyu, Do you still have the Parquet file that caused the error? could you open a JIRA and attach the file to it? I can take a look.
Chao On Thu, Sep 1, 2022 at 4:03 AM FengYu Cao <camper.x...@gmail.com> wrote: > > I'm trying to upgrade our spark (3.2.1 now) > > but with spark 3.3.0 and spark 3.2.2, we had error with specific parquet file > > Is anyone else having the same problem as me? Or do I need to provide any > information to the devs ? > > ``` > > org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in > stage 1.0 failed 4 times, most recent failure: Lost task 3.3 in stage 1.0 > (TID 7) (10.113.39.118 executor 1): java.io.IOException: can not read class > org.apache.parquet.format.PageHeader: don't know what type: 15 > at org.apache.parquet.format.Util.read(Util.java:365) > at org.apache.parquet.format.Util.readPageHeader(Util.java:132) > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:1382) > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1429) > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1402) > at > org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1023) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:928) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:972) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:338) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:293) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:196) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:191) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) > at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source) > at java.base/java.lang.Thread.run(Unknown Source) > Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: > don't know what type: 15 > at > shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:894) > at > shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:560) > at > org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:155) > at > shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:108) > at > shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:60) > at > org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1100) > at > org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1019) > at org.apache.parquet.format.PageHeader.read(PageHeader.java:896) > at org.apache.parquet.format.Util.read(Util.java:362) > ... 32 more > > > ``` > > similar to https://issues.apache.org/jira/browse/SPARK-11844, but we can > reproduce error above > > df = spark.read.parquet("specific_file.snappy.parquet") > df.filter(df['event_name'] == "event1").dropDuplicates(["user_id",]).count() > > with spark 3.2.1, we don't have this issue > with spark 3.3.0/3.2.2,we can reproduce error above > with spark 3.3.0/3.2.2 and set spark.sql.parquet.filterPushdown=false, the > error disappear > > not sure if it releated to https://issues.apache.org/jira/browse/SPARK-39393 > > -- > camper42 (曹丰宇) > Douban, Inc. > > Mobile: +86 15691996359 > E-mail: camper.x...@gmail.com --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org