[ https://issues.apache.org/jira/browse/SPARK-26612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
liupengcheng updated SPARK-26612: --------------------------------- Fix Version/s: 2.3.1 2.4.0 > Speculation kill causing finished stage recomputed > -------------------------------------------------- > > Key: SPARK-26612 > URL: https://issues.apache.org/jira/browse/SPARK-26612 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core > Affects Versions: 2.1.0 > Reporter: liupengcheng > Priority: Major > Fix For: 2.2.2, 2.3.1, 2.4.0 > > > In our production spark cluster, we encoutered this issue. > A more detailed explaination: > Let's say we have two stage: stage0.0 and stage1.0, and stage 0 is a > shuffleMapStage, and stage1 has dependency on stage0, and we enabled > spark.speculation. > when task0.0 of stage1.0 finished, and is trying to kill task0.1(speculative) > of stage1.0, task0.1 throws a wrapped FetchFailedException whose root cause > is java.nio.channels.ClosedByInterruptException(caused by speculation kill). > Exception stack: > {code:java} > at > org.apache.spark.shuffle.BlockStoreShuffleReader$ProcessFetchFailedIterator$$anonfun$hasNext$1.apply$mcZ$sp(BlockStoreShuffleReader.scala:148) > at > org.apache.spark.shuffle.BlockStoreShuffleReader$ProcessFetchFailedIterator$$anonfun$hasNext$1.apply(BlockStoreShuffleReader.scala:148) > at > org.apache.spark.shuffle.BlockStoreShuffleReader$ProcessFetchFailedIterator$$anonfun$hasNext$1.apply(BlockStoreShuffleReader.scala:148) > at > org.apache.spark.shuffle.BlockStoreShuffleReader$ProcessFetchFailedIterator.tryThrowFetchFailedException(BlockStoreShuffleReader.scala:127) > at > org.apache.spark.shuffle.BlockStoreShuffleReader$ProcessFetchFailedIterator.hasNext(BlockStoreShuffleReader.scala:148) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:308) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Error in opening > FileSegmentManagedBuffer{file=/home/work/hdd6/yarn/c3prc-hadoop/nodemanager/usercache/h_user_profile/appcache/application_1505730831071_76097/blockmgr-bb226ff8-dd5f-4296-b3cc-ce7ff5cc60cc/37/shuffle_1_1182_0.data, > offset=17789166, length=35709} > at > org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:114) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:371) > ... 26 more > Caused by: java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:155) > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) > at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109) > {code} > Seems in latest spark version, this problem still exists! > FetchFailedException might be throwed in ShuffleBlockFetcherIterator.next, > where the task is accessing local shuffle block or encountering a stream > corruption. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org