[jira] [Commented] (SPARK-14234) Executor crashes for TaskRunner thread interruption
[ https://issues.apache.org/jira/browse/SPARK-14234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15453414#comment-15453414 ] Barry Becker commented on SPARK-14234: -- Is it a lot of work to backport this fix 1.6.3? We have an app that requires it. We also require job-server and that does not look like it will be supporting 2.0.0 anytime soon. > Executor crashes for TaskRunner thread interruption > --- > > Key: SPARK-14234 > URL: https://issues.apache.org/jira/browse/SPARK-14234 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Devaraj K >Assignee: Devaraj K > Fix For: 2.0.0 > > > If the TaskRunner thread gets interrupted while running due to task kill or > any other reason, the interrupted thread will try to update the task status > as part of the exception handling and fails with the below exception. This is > happening from all of these catch blocks statusUpdate calls, below are the > exceptions correspondingly for all these catch cases. > {code:title=Executor.scala|borderStyle=solid} > case _: TaskKilledException | _: InterruptedException if task.killed > => > .. > case cDE: CommitDeniedException => > .. > case t: Throwable => > .. > {code} > {code:xml} > 16/03/29 17:32:33 ERROR SparkUncaughtExceptionHandler: Uncaught exception in > thread Thread[Executor task launch worker-2,5,main] > java.lang.Error: java.nio.channels.ClosedByInterruptException > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > at > java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:460) > at > org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:49) > at > org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:47) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204) > at > org.apache.spark.util.SerializableBuffer.writeObject(SerializableBuffer.scala:47) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) > at > org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:253) > at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:513) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend.statusUpdate(CoarseGrainedExecutorBackend.scala:135) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > ... 2 more > {code} > {code:xml} > 16/03/29 08:00:29 ERROR SparkUncaughtExceptionHandler: Uncaught exception in > thread Thread[Executor task launch worker-4,5,main] > java.lang.Error: java.nio.channels.ClosedByInterruptException > at > java.util.concurrent.ThreadPoolExecutor.runWorker
[jira] [Commented] (SPARK-14234) Executor crashes for TaskRunner thread interruption
[ https://issues.apache.org/jira/browse/SPARK-14234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15379041#comment-15379041 ] Josef Lindman Hörnlund commented on SPARK-14234: +1 for backporting > Executor crashes for TaskRunner thread interruption > --- > > Key: SPARK-14234 > URL: https://issues.apache.org/jira/browse/SPARK-14234 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Devaraj K >Assignee: Devaraj K > Fix For: 2.0.0 > > > If the TaskRunner thread gets interrupted while running due to task kill or > any other reason, the interrupted thread will try to update the task status > as part of the exception handling and fails with the below exception. This is > happening from all of these catch blocks statusUpdate calls, below are the > exceptions correspondingly for all these catch cases. > {code:title=Executor.scala|borderStyle=solid} > case _: TaskKilledException | _: InterruptedException if task.killed > => > .. > case cDE: CommitDeniedException => > .. > case t: Throwable => > .. > {code} > {code:xml} > 16/03/29 17:32:33 ERROR SparkUncaughtExceptionHandler: Uncaught exception in > thread Thread[Executor task launch worker-2,5,main] > java.lang.Error: java.nio.channels.ClosedByInterruptException > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > at > java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:460) > at > org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:49) > at > org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:47) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204) > at > org.apache.spark.util.SerializableBuffer.writeObject(SerializableBuffer.scala:47) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) > at > org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:253) > at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:513) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend.statusUpdate(CoarseGrainedExecutorBackend.scala:135) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > ... 2 more > {code} > {code:xml} > 16/03/29 08:00:29 ERROR SparkUncaughtExceptionHandler: Uncaught exception in > thread Thread[Executor task launch worker-4,5,main] > java.lang.Error: java.nio.channels.ClosedByInterruptException > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.
[jira] [Commented] (SPARK-14234) Executor crashes for TaskRunner thread interruption
[ https://issues.apache.org/jira/browse/SPARK-14234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298333#comment-15298333 ] Barry Becker commented on SPARK-14234: -- Will this fix be back-ported to 1.6.x? We are encountering what appears to be this same issue when using spark 1.6.1 and jobserver 0.6.2. Looking into the logs, we narrowed down the problem to killing of a task and can successfully reproduce this by killing two tasks in a row. It appears that mesos slave gets blacklisted after a repeated failure and never gets back up. First time a task is killed we can see this in the spark-job-server.log file: {code} [2016-04-22 10:11:56,919] INFO k.jobserver.JobStatusActor [] [akka://JobServer/user/context-supervisor/sql-context/$a] - Job 0ecdbe5a-bde1-4818-ba24-b5af0fbee5af killed [2016-04-22 10:11:56,921] ERROR k.jobserver.JobStatusActor [] [akka://JobServer/user/context-supervisor/sql-context/$a] - No such job id 0ecdbe5a-bde1-4818-ba24-b5af0fbee5af [2016-04-22 10:11:56,920] INFO cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql-context] - Cancelling stage 99 [2016-04-22 10:11:56,920] INFO cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql-context] - Stage 99 was cancelled [2016-04-22 10:11:56,924] INFO he.spark.executor.Executor [] [] - Executor is trying to kill task 0.0 in stage 99.0 (TID 736) [2016-04-22 10:11:56,924] INFO he.spark.executor.Executor [] [] - Executor is trying to kill task 1.0 in stage 99.0 (TID 737) [2016-04-22 10:11:56,925] INFO he.spark.executor.Executor [] [] - Executor killed task 1.0 in stage 99.0 (TID 737) [2016-04-22 10:11:56,925] INFO he.spark.executor.Executor [] [] - Executor killed task 0.0 in stage 99.0 (TID 736) [2016-04-22 10:11:56,933] ERROR rkUncaughtExceptionHandler [] [] - Uncaught exception in thread Thread[Executor task launch worker-25,5,main] java.lang.Error: java.nio.channels.ClosedByInterruptException at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.nio.channels.ClosedByInterruptException {code} A few minutes later, another task gets killed: {code} [2016-04-22 10:16:49,890] INFO k.jobserver.JobStatusActor [] [akka://JobServer/user/context-supervisor/sql-context/$a] - Job cf0c58e9-6496-4d5d-8a6f-0072ca742e33 killed [2016-04-22 10:16:49,891] INFO cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql-context] - Cancelling stage 101 [2016-04-22 10:16:49,891] INFO cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql-context] - Stage 101 was cancelled [2016-04-22 10:16:49,892] ERROR k.jobserver.JobStatusActor [] [akka://JobServer/user/context-supervisor/sql-context/$a] - No such job id cf0c58e9-6496-4d5d-8a6f-0072ca742e33 [2016-04-22 10:16:50,254] ERROR cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql-context] - Lost executor 20160216-173849-2066065046-5050-48639-S0 on ra.engr.sgi.com: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. [2016-04-22 10:16:50,254] WARN k.scheduler.TaskSetManager [] [akka://JobServer/user/context-supervisor/sql-context] - Lost task 0.0 in stage 101.0 (TID 738, ra.engr.sgi.com): ExecutorLostFailure (executor 20160216-173849-2066065046-5050-48639-S0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. [2016-04-22 10:16:50,254] WARN k.scheduler.TaskSetManager [] [akka://JobServer/user/context-supervisor/sql-context] - Lost task 1.0 in stage 101.0 (TID 739, ra.engr.sgi.com): ExecutorLostFailure (executor 20160216-173849-2066065046-5050-48639-S0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. [2016-04-22 10:16:50,254] INFO cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/sql-context] - Removed TaskSet 101.0, whose tasks have all completed, from pool [2016-04-22 10:16:50,255] INFO BlockManagerMasterEndpoint [] [akka://JobServer/user/context-supervisor/sql-context] - Trying to remove executor 20160216-173849-2066065046-5050-48639-S0 from BlockManagerMaster. [2016-04-22 10:16:50,255] INFO BlockManagerMasterEndpoint [] [akka://JobServer/user/context-supervisor/sql-context] - Removing block manager BlockManagerId(20160216-173849-2066065046-5050-48639-S0, ra.engr.sgi.com, 46374) [2016-04-22 10:16:50,255] INFO storage.BlockManagerMaster [] [akka://JobServer/user/context-supervisor/sql-context] - Removed 20160216-173849-2066065046-5050-48639-S0 successf
[jira] [Commented] (SPARK-14234) Executor crashes for TaskRunner thread interruption
[ https://issues.apache.org/jira/browse/SPARK-14234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15215679#comment-15215679 ] Apache Spark commented on SPARK-14234: -- User 'devaraj-kavali' has created a pull request for this issue: https://github.com/apache/spark/pull/12031 > Executor crashes for TaskRunner thread interruption > --- > > Key: SPARK-14234 > URL: https://issues.apache.org/jira/browse/SPARK-14234 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Devaraj K > > If the TaskRunner thread gets interrupted while running due to task kill or > any other reason, the interrupted thread will try to update the task status > as part of the exception handling and fails with the below exception. This is > happening from all of these catch blocks statusUpdate calls, below are the > exceptions correspondingly for all these catch cases. > {code:title=Executor.scala|borderStyle=solid} > case _: TaskKilledException | _: InterruptedException if task.killed > => > .. > case cDE: CommitDeniedException => > .. > case t: Throwable => > .. > {code} > {code:xml} > 16/03/29 17:32:33 ERROR SparkUncaughtExceptionHandler: Uncaught exception in > thread Thread[Executor task launch worker-2,5,main] > java.lang.Error: java.nio.channels.ClosedByInterruptException > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > at > java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:460) > at > org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:49) > at > org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:47) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204) > at > org.apache.spark.util.SerializableBuffer.writeObject(SerializableBuffer.scala:47) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) > at > org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:253) > at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) > at > org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:513) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend.statusUpdate(CoarseGrainedExecutorBackend.scala:135) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > ... 2 more > {code} > {code:xml} > 16/03/29 08:00:29 ERROR SparkUncaughtExceptionHandler: Uncaught exception in > thread Thread[Executor task launch worker-4,5,main] > java.lang.Error: java.nio.channels.ClosedByInterruptException > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at