I'm trying to upgrade our spark (3.2.1 now)

but with spark 3.3.0 and spark 3.2.2, we had error with specific parquet
file

Is anyone else having the same problem as me? Or do I need to provide any
information to the devs ?

```

org.apache.spark.SparkException: Job aborted due to stage failure:
Task 3 in stage 1.0 failed 4 times, most recent failure: Lost task 3.3
in stage 1.0 (TID 7) (10.113.39.118 executor 1): java.io.IOException:
can not read class org.apache.parquet.format.PageHeader: don't know
what type: 15
        at org.apache.parquet.format.Util.read(Util.java:365)
        at org.apache.parquet.format.Util.readPageHeader(Util.java:132)
        at 
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:1382)
        at 
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1429)
        at 
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1402)
        at 
org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1023)
        at 
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:928)
        at 
org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:972)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:338)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:293)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:196)
        at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:191)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
        at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
        at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)
        at java.base/java.lang.Thread.run(Unknown Source)
Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException:
don't know what type: 15
        at 
shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:894)
        at 
shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:560)
        at 
org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:155)
        at 
shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:108)
        at 
shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:60)
        at 
org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1100)
        at 
org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1019)
        at org.apache.parquet.format.PageHeader.read(PageHeader.java:896)
        at org.apache.parquet.format.Util.read(Util.java:362)
        ... 32 more


```

similar to https://issues.apache.org/jira/browse/SPARK-11844, but we can
reproduce error above


*df = spark.read.parquet("specific_file.snappy.parquet")*
*df.filter(df['event_name'] ==
"event1").dropDuplicates(["user_id",]).count()*

with spark 3.2.1, we don't have this issue
with spark 3.3.0/3.2.2,we can reproduce error above
with spark 3.3.0/3.2.2 and set* spark.sql.parquet.filterPushdown=false*,
the error disappear

not sure if it releated to https://issues.apache.org/jira/browse/SPARK-39393

--
*camper42 (曹丰宇)*
Douban, Inc.

Mobile: +86 15691996359
E-mail:  camper.x...@gmail.com

Reply via email to