Hi, We are seeing this issue with Spark 1.6.1. The executor is exiting when one of the running tasks is cancelled.
The executor logs is showing the below error and crashing. 16/04/27 16:34:13 ERROR SparkUncaughtExceptionHandler: [Container in > shutdown] Uncaught exception in thread Thread[Executor task launch > worker-2,5,main] > java.lang.Error: java.nio.channels.ClosedByInterruptException > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > at > java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:460) > at > org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:49) > at > org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:47) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219) > at > org.apache.spark.util.SerializableBuffer.writeObject(SerializableBuffer.scala:47) > at sun.reflect.GeneratedMethodAccessor30.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:252) > at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:195) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend.statusUpdate(CoarseGrainedExecutorBackend.scala:132) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:288) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > ... 2 more I have attached the full logs at this gist: https://gist.github.com/kiranchitturi/3bd3a083a7c956cff73040c1a140c88f On the driver side, the following info is logged ( https://gist.github.com/kiranchitturi/3bd3a083a7c956cff73040c1a140c88f) The following lines show that Executor exited because of the running tasks 2016-04-27T16:34:13,723 - WARN [dispatcher-event-loop-1:Logging$class@70] - > Lost task 0.0 in stage 89.0 (TID 173, 10.0.0.42): ExecutorLostFailure > (executor 2 exited caused by one of the running tasks) Reason: Remote RPC > client di > sassociated. Likely due to containers exceeding thresholds, or network > issues. Check driver logs for WARN messages. > 2016-04-27T16:34:13,723 - WARN [dispatcher-event-loop-1:Logging$class@70] > - Lost task 1.0 in stage 89.0 (TID 174, 10.0.0.42): ExecutorLostFailure > (executor 2 exited caused by one of the running tasks) Reason: Remote RPC > client di Is it possible for executor to die when the jobs in the sparkContext are cancelled ? Apart from https://issues.apache.org/jira/browse/SPARK-14234, I could not find any Jiras that report this error. Sometimes, we notice a scenario where the executor dies and driver doesn't request for a new one. This causes the jobs to hang indefinitely. We are using dynamic allocation for our jobs. Thanks, > Kiran Chitturi