I will open a JIRA, but since it's our production event log, can't attach
to it.

try to setup a debugger  to provider more information.

Chao Sun <sunc...@apache.org> 于2022年9月1日周四 23:06写道:

> Hi Fengyu,
>
> Do you still have the Parquet file that caused the error? could you
> open a JIRA and attach the file to it? I can take a look.
>
> Chao
>
> On Thu, Sep 1, 2022 at 4:03 AM FengYu Cao <camper.x...@gmail.com> wrote:
> >
> > I'm trying to upgrade our spark (3.2.1 now)
> >
> > but with spark 3.3.0 and spark 3.2.2, we had error with specific parquet
> file
> >
> > Is anyone else having the same problem as me? Or do I need to provide
> any information to the devs ?
> >
> > ```
> >
> > org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 3 in stage 1.0 failed 4 times, most recent failure: Lost task 3.3 in stage
> 1.0 (TID 7) (10.113.39.118 executor 1): java.io.IOException: can not read
> class org.apache.parquet.format.PageHeader: don't know what type: 15
> > at org.apache.parquet.format.Util.read(Util.java:365)
> > at org.apache.parquet.format.Util.readPageHeader(Util.java:132)
> > at
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:1382)
> > at
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1429)
> > at
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1402)
> > at
> org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1023)
> > at
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:928)
> > at
> org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:972)
> > at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:338)
> > at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:293)
> > at
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:196)
> > at
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> > at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
> > at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:191)
> > at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
> > at
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:522)
> > at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
> Source)
> > at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown
> Source)
> > at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source)
> > at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> > at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
> > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
> > at
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
> > at
> org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
> > at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> > at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
> > at org.apache.spark.scheduler.Task.run(Task.scala:131)
> > at
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
> > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
> > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
> > at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> Source)
> > at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)
> > at java.base/java.lang.Thread.run(Unknown Source)
> > Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException:
> don't know what type: 15
> > at
> shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:894)
> > at
> shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:560)
> > at
> org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:155)
> > at
> shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:108)
> > at
> shaded.parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:60)
> > at
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1100)
> > at
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1019)
> > at org.apache.parquet.format.PageHeader.read(PageHeader.java:896)
> > at org.apache.parquet.format.Util.read(Util.java:362)
> > ... 32 more
> >
> >
> > ```
> >
> > similar to https://issues.apache.org/jira/browse/SPARK-11844, but we
> can reproduce error above
> >
> > df = spark.read.parquet("specific_file.snappy.parquet")
> > df.filter(df['event_name'] ==
> "event1").dropDuplicates(["user_id",]).count()
> >
> > with spark 3.2.1, we don't have this issue
> > with spark 3.3.0/3.2.2,we can reproduce error above
> > with spark 3.3.0/3.2.2 and set spark.sql.parquet.filterPushdown=false,
> the error disappear
> >
> > not sure if it releated to
> https://issues.apache.org/jira/browse/SPARK-39393
> >
> > --
> > camper42 (曹丰宇)
> > Douban, Inc.
> >
> > Mobile: +86 15691996359
> > E-mail:  camper.x...@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
*camper42 (曹丰宇)*
Douban, Inc.

Mobile: +86 15691996359
E-mail:  camper.x...@gmail.com

Reply via email to