[ 
https://issues.apache.org/jira/browse/SPARK-36255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi resolved SPARK-36255.
--------------------------
    Fix Version/s: 3.3.0
                   3.2.0
         Assignee: Chandni Singh
       Resolution: Fixed

Issue resolved by https://github.com/apache/spark/pull/33477

> FileNotFoundException from the shuffle push can cause the executor to 
> terminate
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-36255
>                 URL: https://issues.apache.org/jira/browse/SPARK-36255
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Shuffle
>    Affects Versions: 3.1.0
>            Reporter: Chandni Singh
>            Assignee: Chandni Singh
>            Priority: Major
>             Fix For: 3.2.0, 3.3.0
>
>
> Once the shuffle is cleaned up by the {{ContextCleaner}}, the shuffle files 
> are deleted by the executors. In this case, the push of the shuffle data by 
> the executors can throw {{FileNotFoundException}}. When this exception is 
> thrown from the {{shuffle-block-push-thread}}, it causes the executor to 
> fail. This is because of the default uncaught exception handler for Spark 
> daemon threads which terminates the executor when there are uncaught 
> exceptions for the daemon threads.
> {code:java}
> 21/06/17 16:03:57 ERROR util.SparkUncaughtExceptionHandler: Uncaught 
> exception in thread Thread[block-push-thread-1,5,main]
> java.lang.Error: java.io.IOException: Error in opening 
> FileSegmentManagedBuffer
> {file=********/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data,
>  offset=10640, length=190}
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Error in opening 
> FileSegmentManagedBuffer\{file=*******/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data,
>  offset=10640, length=190}
> at 
> org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:89)
> at 
> org.apache.spark.shuffle.ShuffleWriter.sliceReqBufferIntoBlockBuffers(ShuffleWriter.scala:294)
> at 
> org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$sendRequest(ShuffleWriter.scala:270)
> at 
> org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$pushUpToMax(ShuffleWriter.scala:191)
> at 
> org.apache.spark.shuffle.ShuffleWriter$$anon$2$$anon$4.run(ShuffleWriter.scala:244)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> ... 2 more
> Caused by: java.io.FileNotFoundException: 
> ******/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data
>  (No such file or directory)
> at java.io.RandomAccessFile.open0(Native Method)
> at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
> at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
> at 
> org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:62)
> {code}
> We can address the issue by handling "FileNotFound" exceptions in the push 
> threads and netty threads by stopping the push when {{FileNotFound}} is 
> encountered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to