Hi Fengyu,

Do you still have the Parquet file that caused the error? could you
open a JIRA and attach the file to it? I can take a look.

Chao

On Thu, Sep 1, 2022 at 4:03 AM FengYu Cao <camper.x...@gmail.com> wrote:
>
> I'm trying to upgrade our spark (3.2.1 now)
>
> but with spark 3.3.0 and spark 3.2.2, we had error with specific parquet file
>
> Is anyone else having the same problem as me? Or do I need to provide any 
> information to the devs ?
>
> ```
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 3.3 in stage 1.0 
> (TID 7) (10.113.39.118 executor 1): java.io.IOException: can not read class 
> org.apache.parquet.format.PageHeader: don't know what type: 15
> at org.apache.parquet.format.Util.read(Util.java:365)
> at org.apache.parquet.format.Util.readPageHeader(Util.java:132)
> at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:1382)
> at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1429)
> at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1402)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1023)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:928)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:972)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:338)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:293)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:196)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:191)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
> at 
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
> at org.apache.spark.scheduler.Task.run(Task.scala:131)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
> Source)
> at java.base/java.lang.Thread.run(Unknown Source)
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> don't know what type: 15
> at 
> shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:894)
> at 
> shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:560)
> at 
> org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:155)
> at 
> shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:108)
> at 
> shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:60)
> at 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1100)
> at 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1019)
> at org.apache.parquet.format.PageHeader.read(PageHeader.java:896)
> at org.apache.parquet.format.Util.read(Util.java:362)
> ... 32 more
>
>
> ```
>
> similar to https://issues.apache.org/jira/browse/SPARK-11844, but we can 
> reproduce error above
>
> df = spark.read.parquet("specific_file.snappy.parquet")
> df.filter(df['event_name'] == "event1").dropDuplicates(["user_id",]).count()
>
> with spark 3.2.1, we don't have this issue
> with spark 3.3.0/3.2.2,we can reproduce error above
> with spark 3.3.0/3.2.2 and set spark.sql.parquet.filterPushdown=false, the 
> error disappear
>
> not sure if it releated to https://issues.apache.org/jira/browse/SPARK-39393
>
> --
> camper42 (曹丰宇)
> Douban, Inc.
>
> Mobile: +86 15691996359
> E-mail:  camper.x...@gmail.com

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to