I will open a JIRA, but since it's our production event log, can't attach to it.
try to setup a debugger to provider more information. Chao Sun <sunc...@apache.org> 于2022年9月1日周四 23:06写道: > Hi Fengyu, > > Do you still have the Parquet file that caused the error? could you > open a JIRA and attach the file to it? I can take a look. > > Chao > > On Thu, Sep 1, 2022 at 4:03 AM FengYu Cao <camper.x...@gmail.com> wrote: > > > > I'm trying to upgrade our spark (3.2.1 now) > > > > but with spark 3.3.0 and spark 3.2.2, we had error with specific parquet > file > > > > Is anyone else having the same problem as me? Or do I need to provide > any information to the devs ? > > > > ``` > > > > org.apache.spark.SparkException: Job aborted due to stage failure: Task > 3 in stage 1.0 failed 4 times, most recent failure: Lost task 3.3 in stage > 1.0 (TID 7) (10.113.39.118 executor 1): java.io.IOException: can not read > class org.apache.parquet.format.PageHeader: don't know what type: 15 > > at org.apache.parquet.format.Util.read(Util.java:365) > > at org.apache.parquet.format.Util.readPageHeader(Util.java:132) > > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:1382) > > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1429) > > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1402) > > at > org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1023) > > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:928) > > at > org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:972) > > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:338) > > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:293) > > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:196) > > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) > > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:191) > > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) > > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522) > > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown > Source) > > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) > > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) > > at > org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) > > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) > > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) > > at org.apache.spark.scheduler.Task.run(Task.scala:131) > > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) > > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491) > > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) > > at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown > Source) > > at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source) > > at java.base/java.lang.Thread.run(Unknown Source) > > Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: > don't know what type: 15 > > at > shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:894) > > at > shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:560) > > at > org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:155) > > at > shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:108) > > at > shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:60) > > at > org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1100) > > at > org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1019) > > at org.apache.parquet.format.PageHeader.read(PageHeader.java:896) > > at org.apache.parquet.format.Util.read(Util.java:362) > > ... 32 more > > > > > > ``` > > > > similar to https://issues.apache.org/jira/browse/SPARK-11844, but we > can reproduce error above > > > > df = spark.read.parquet("specific_file.snappy.parquet") > > df.filter(df['event_name'] == > "event1").dropDuplicates(["user_id",]).count() > > > > with spark 3.2.1, we don't have this issue > > with spark 3.3.0/3.2.2,we can reproduce error above > > with spark 3.3.0/3.2.2 and set spark.sql.parquet.filterPushdown=false, > the error disappear > > > > not sure if it releated to > https://issues.apache.org/jira/browse/SPARK-39393 > > > > -- > > camper42 (曹丰宇) > > Douban, Inc. > > > > Mobile: +86 15691996359 > > E-mail: camper.x...@gmail.com > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- *camper42 (曹丰宇)* Douban, Inc. Mobile: +86 15691996359 E-mail: camper.x...@gmail.com