[ https://issues.apache.org/jira/browse/SPARK-36255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chandni Singh updated SPARK-36255: ---------------------------------- Summary: FileNotFound exceptions from the shuffle push can cause the executor to terminate (was: FileNotFound exceptions in the Shuffle-push-thread can cause the executor to fail) > FileNotFound exceptions from the shuffle push can cause the executor to > terminate > --------------------------------------------------------------------------------- > > Key: SPARK-36255 > URL: https://issues.apache.org/jira/browse/SPARK-36255 > Project: Spark > Issue Type: Sub-task > Components: Shuffle > Affects Versions: 3.1.0 > Reporter: Chandni Singh > Priority: Major > > When the shuffle files are cleaned up by the executors once a job in a Spark > application completes, the push of the shuffle data by the executor can throw > FileNotFound exception. When this exception is thrown from the > {{shuffle-block-push-thread}}, it causes the executor to fail. This is > because of the default uncaught exception handler for Spark daemon threads > which terminates the executor when there are uncaught exceptions for the > daemon threads. > {code:java} > 21/06/17 16:03:57 ERROR util.SparkUncaughtExceptionHandler: Uncaught > exception in thread Thread[block-push-thread-1,5,main] > java.lang.Error: java.io.IOException: Error in opening > FileSegmentManagedBuffer > {file=********/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data, > offset=10640, length=190} > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: Error in opening > FileSegmentManagedBuffer\{file=*******/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data, > offset=10640, length=190} > at > org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:89) > at > org.apache.spark.shuffle.ShuffleWriter.sliceReqBufferIntoBlockBuffers(ShuffleWriter.scala:294) > at > org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$sendRequest(ShuffleWriter.scala:270) > at > org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$pushUpToMax(ShuffleWriter.scala:191) > at > org.apache.spark.shuffle.ShuffleWriter$$anon$2$$anon$4.run(ShuffleWriter.scala:244) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > ... 2 more > Caused by: java.io.FileNotFoundException: > ******/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data > (No such file or directory) > at java.io.RandomAccessFile.open0(Native Method) > at java.io.RandomAccessFile.open(RandomAccessFile.java:316) > at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243) > at > org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:62) > {code} > We can address the issue by handling "FileNotFound" exceptions in the push > threads and netty threads by stopping the push when {{FileNotFound}} is > encountered. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org