Re: [PR] [GLUTEN-9801] Only delete the files created by the failed task [incubator-gluten]

via GitHub Thu, 05 Jun 2025 00:49:01 -0700


RushabhK commented on PR #9844:
URL: 
https://github.com/apache/incubator-gluten/pull/9844#issuecomment-2943073257


   > > @JkSelf I tested this change on my setup. It's still giving the same 
exception, is not a Parquet file. Expected magic number at tail, but found [2, 
0, 0, 0]. This file is ~250 MB size.
   > > This is the complete stack trace:
   > > ```
   > > Py4JJavaError: An error occurred while calling o135.count.
   > > : org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 1161 in stage 1.0 failed 4 times, most recent failure: Lost task 1161.3 in 
stage 1.0 (TID 1208) (241.130.178.8 executor 2): java.lang.RuntimeException: 
gs://<some_path>/gluten-part-d0a3b6a4-ccc9-41b3-a44e-34177ab18674.zstd.parquet 
is not a Parquet file. Expected magic number at tail, but found [2, 0, 0, 0]
   > >  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:565)
   > >  at 
org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:799)
   > >  at 
org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:666)
   > >  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:85)
   > >  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:71)
   > >  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:66)
   > >  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:213)
   > >  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:219)
   > >  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:282)
   > >  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:131)
   > >  at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:593)
   > >  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
   > >  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown
 Source)
   > >  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   > >  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   > >  at 
org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
   > >  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
   > >  at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
   > >  at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
   > >  at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
   > >  at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
   > >  at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
   > >  at org.apache.spark.scheduler.Task.run(Task.scala:141)
   > >  at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
   > >  at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
   > >  at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
   > >  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
   > >  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
   > >  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   > >  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   > >  at java.lang.Thread.run(Thread.java:750)
   > > 
   > > Driver stacktrace:
   > >  at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
   > >  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
   > >  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
   > >  at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
   > >  at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
   > >  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
   > >  at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
   > >  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
   > >  at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
   > >  at scala.Option.foreach(Option.scala:407)
   > >  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
   > >  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
   > >  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
   > >  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
   > >  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
   > > Caused by: java.lang.RuntimeException: 
gs://<some_path>/gluten-part-d0a3b6a4-ccc9-41b3-a44e-34177ab18674.zstd.parquet 
is not a Parquet file. Expected magic number at tail, but found [2, 0, 0, 0]
   > >  at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:565)
   > >  at 
org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:799)
   > >  at 
org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:666)
   > >  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:85)
   > >  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:71)
   > >  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFooterReader.readFooter(ParquetFooterReader.java:66)
   > >  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:213)
   > >  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:219)
   > >  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:282)
   > >  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:131)
   > >  at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:593)
   > >  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
   > >  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown
 Source)
   > >  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   > >  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   > >  at 
org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
   > >  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
   > >  at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
   > >  at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
   > >  at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104)
   > >  at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
   > >  at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
   > >  at org.apache.spark.scheduler.Task.run(Task.scala:141)
   > >  at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
   > >  at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
   > >  at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
   > >  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
   > >  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
   > >  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   > >  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   > >  at java.lang.Thread.run(Thread.java:750)
   > > [Stage 1:======================================>              (1168 + 4) 
/ 1627]
   > > ```
   > 
   > @RushabhK Ok. Can you help to provide the reproduced code? Thanks.
   
   @JkSelf I can elaborate how I am testing this in the following steps
   1. I took the gluten build with these changes, built my new spark image
   2. I have a spark job which writes parquet with 300 tasks, 8 core per 
executor is the config. 
   3. While it is writing from the 300 tasks, I kill one of the executors (8 
failed tasks), it retries and then it finishes. 
   4. I then try reading the parquet files and just do a df.count() on it for 
it to materialize. This is when I encounter the following exception.
   
   
![image](https://github.com/user-attachments/assets/4ad5ea90-71da-4804-9c97-8e43908beca2)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [GLUTEN-9801] Only delete the files created by the failed task [incubator-gluten]

Reply via email to