[jira] [Commented] (SPARK-14234) Executor crashes for TaskRunner thread interruption

2016-08-31 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15453414#comment-15453414
 ] 

Barry Becker commented on SPARK-14234:
--

Is it a lot of work to backport this fix 1.6.3?
We have an app that requires it. We also require job-server and that does not 
look like it will be supporting 2.0.0 anytime soon.


> Executor crashes for TaskRunner thread interruption
> ---
>
> Key: SPARK-14234
> URL: https://issues.apache.org/jira/browse/SPARK-14234
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Devaraj K
>Assignee: Devaraj K
> Fix For: 2.0.0
>
>
> If the TaskRunner thread gets interrupted while running due to task kill or 
> any other reason, the interrupted thread will try to update the task status 
> as part of the exception handling and fails with the below exception. This is 
> happening from all of these catch blocks statusUpdate calls, below are the 
> exceptions correspondingly for all these catch cases.
> {code:title=Executor.scala|borderStyle=solid}
> case _: TaskKilledException | _: InterruptedException if task.killed 
> =>
>  ..
> case cDE: CommitDeniedException =>
>  ..
> case t: Throwable =>
>  ..
> {code}
> {code:xml}
> 16/03/29 17:32:33 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-2,5,main]
> java.lang.Error: java.nio.channels.ClosedByInterruptException
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Caused by: java.nio.channels.ClosedByInterruptException
>   at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>   at 
> java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:460)
>   at 
> org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:49)
>   at 
> org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:47)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>   at 
> org.apache.spark.util.SerializableBuffer.writeObject(SerializableBuffer.scala:47)
>   at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:253)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:513)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.statusUpdate(CoarseGrainedExecutorBackend.scala:135)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   ... 2 more
> {code}
> {code:xml}
> 16/03/29 08:00:29 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-4,5,main]
> java.lang.Error: java.nio.channels.ClosedByInterruptException
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker

[jira] [Commented] (SPARK-14234) Executor crashes for TaskRunner thread interruption

2016-07-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-14234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15379041#comment-15379041
 ] 

Josef Lindman Hörnlund commented on SPARK-14234:


+1 for backporting

> Executor crashes for TaskRunner thread interruption
> ---
>
> Key: SPARK-14234
> URL: https://issues.apache.org/jira/browse/SPARK-14234
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Devaraj K
>Assignee: Devaraj K
> Fix For: 2.0.0
>
>
> If the TaskRunner thread gets interrupted while running due to task kill or 
> any other reason, the interrupted thread will try to update the task status 
> as part of the exception handling and fails with the below exception. This is 
> happening from all of these catch blocks statusUpdate calls, below are the 
> exceptions correspondingly for all these catch cases.
> {code:title=Executor.scala|borderStyle=solid}
> case _: TaskKilledException | _: InterruptedException if task.killed 
> =>
>  ..
> case cDE: CommitDeniedException =>
>  ..
> case t: Throwable =>
>  ..
> {code}
> {code:xml}
> 16/03/29 17:32:33 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-2,5,main]
> java.lang.Error: java.nio.channels.ClosedByInterruptException
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Caused by: java.nio.channels.ClosedByInterruptException
>   at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>   at 
> java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:460)
>   at 
> org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:49)
>   at 
> org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:47)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>   at 
> org.apache.spark.util.SerializableBuffer.writeObject(SerializableBuffer.scala:47)
>   at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:253)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:513)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.statusUpdate(CoarseGrainedExecutorBackend.scala:135)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   ... 2 more
> {code}
> {code:xml}
> 16/03/29 08:00:29 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-4,5,main]
> java.lang.Error: java.nio.channels.ClosedByInterruptException
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.

[jira] [Commented] (SPARK-14234) Executor crashes for TaskRunner thread interruption

2016-05-24 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298333#comment-15298333
 ] 

Barry Becker commented on SPARK-14234:
--

Will this fix be back-ported to 1.6.x?
We are encountering what appears to be this same issue when using spark 1.6.1 
and jobserver 0.6.2.

Looking into the logs, we narrowed down the problem to killing of a task and 
can successfully reproduce this by killing two tasks in a row. It appears that 
mesos slave gets blacklisted after a repeated failure and never gets back up.
First time a task is killed we can see this in the spark-job-server.log file:
{code}
[2016-04-22 10:11:56,919] INFO k.jobserver.JobStatusActor [] 
[akka://JobServer/user/context-supervisor/sql-context/$a] - Job 
0ecdbe5a-bde1-4818-ba24-b5af0fbee5af killed
[2016-04-22 10:11:56,921] ERROR k.jobserver.JobStatusActor [] 
[akka://JobServer/user/context-supervisor/sql-context/$a] - No such job id 
0ecdbe5a-bde1-4818-ba24-b5af0fbee5af
[2016-04-22 10:11:56,920] INFO cheduler.TaskSchedulerImpl [] 
[akka://JobServer/user/context-supervisor/sql-context] - Cancelling stage 99
[2016-04-22 10:11:56,920] INFO cheduler.TaskSchedulerImpl [] 
[akka://JobServer/user/context-supervisor/sql-context] - Stage 99 was cancelled
[2016-04-22 10:11:56,924] INFO he.spark.executor.Executor [] [] - Executor is 
trying to kill task 0.0 in stage 99.0 (TID 736)
[2016-04-22 10:11:56,924] INFO he.spark.executor.Executor [] [] - Executor is 
trying to kill task 1.0 in stage 99.0 (TID 737)
[2016-04-22 10:11:56,925] INFO he.spark.executor.Executor [] [] - Executor 
killed task 1.0 in stage 99.0 (TID 737)
[2016-04-22 10:11:56,925] INFO he.spark.executor.Executor [] [] - Executor 
killed task 0.0 in stage 99.0 (TID 736)
[2016-04-22 10:11:56,933] ERROR rkUncaughtExceptionHandler [] [] - Uncaught 
exception in thread Thread[Executor task launch worker-25,5,main]
java.lang.Error: java.nio.channels.ClosedByInterruptException
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.ClosedByInterruptException
{code}
A few minutes later, another task gets killed:
{code}
[2016-04-22 10:16:49,890] INFO k.jobserver.JobStatusActor [] 
[akka://JobServer/user/context-supervisor/sql-context/$a] - Job 
cf0c58e9-6496-4d5d-8a6f-0072ca742e33 killed
[2016-04-22 10:16:49,891] INFO cheduler.TaskSchedulerImpl [] 
[akka://JobServer/user/context-supervisor/sql-context] - Cancelling stage 101
[2016-04-22 10:16:49,891] INFO cheduler.TaskSchedulerImpl [] 
[akka://JobServer/user/context-supervisor/sql-context] - Stage 101 was cancelled
[2016-04-22 10:16:49,892] ERROR k.jobserver.JobStatusActor [] 
[akka://JobServer/user/context-supervisor/sql-context/$a] - No such job id 
cf0c58e9-6496-4d5d-8a6f-0072ca742e33
[2016-04-22 10:16:50,254] ERROR cheduler.TaskSchedulerImpl [] 
[akka://JobServer/user/context-supervisor/sql-context] - Lost executor 
20160216-173849-2066065046-5050-48639-S0 on ra.engr.sgi.com: Remote RPC client 
disassociated. Likely due to containers exceeding thresholds, or network 
issues. Check driver logs for WARN messages.
[2016-04-22 10:16:50,254] WARN k.scheduler.TaskSetManager [] 
[akka://JobServer/user/context-supervisor/sql-context] - Lost task 0.0 in stage 
101.0 (TID 738, ra.engr.sgi.com): ExecutorLostFailure (executor 
20160216-173849-2066065046-5050-48639-S0 exited caused by one of the running 
tasks) Reason: Remote RPC client disassociated. Likely due to containers 
exceeding thresholds, or network issues. Check driver logs for WARN messages.
[2016-04-22 10:16:50,254] WARN k.scheduler.TaskSetManager [] 
[akka://JobServer/user/context-supervisor/sql-context] - Lost task 1.0 in stage 
101.0 (TID 739, ra.engr.sgi.com): ExecutorLostFailure (executor 
20160216-173849-2066065046-5050-48639-S0 exited caused by one of the running 
tasks) Reason: Remote RPC client disassociated. Likely due to containers 
exceeding thresholds, or network issues. Check driver logs for WARN messages.
[2016-04-22 10:16:50,254] INFO cheduler.TaskSchedulerImpl [] 
[akka://JobServer/user/context-supervisor/sql-context] - Removed TaskSet 101.0, 
whose tasks have all completed, from pool
[2016-04-22 10:16:50,255] INFO BlockManagerMasterEndpoint [] 
[akka://JobServer/user/context-supervisor/sql-context] - Trying to remove 
executor 20160216-173849-2066065046-5050-48639-S0 from BlockManagerMaster.
[2016-04-22 10:16:50,255] INFO BlockManagerMasterEndpoint [] 
[akka://JobServer/user/context-supervisor/sql-context] - Removing block manager 
BlockManagerId(20160216-173849-2066065046-5050-48639-S0, ra.engr.sgi.com, 46374)
[2016-04-22 10:16:50,255] INFO storage.BlockManagerMaster [] 
[akka://JobServer/user/context-supervisor/sql-context] - Removed 
20160216-173849-2066065046-5050-48639-S0 successf

[jira] [Commented] (SPARK-14234) Executor crashes for TaskRunner thread interruption

2016-03-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15215679#comment-15215679
 ] 

Apache Spark commented on SPARK-14234:
--

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/12031

> Executor crashes for TaskRunner thread interruption
> ---
>
> Key: SPARK-14234
> URL: https://issues.apache.org/jira/browse/SPARK-14234
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Devaraj K
>
> If the TaskRunner thread gets interrupted while running due to task kill or 
> any other reason, the interrupted thread will try to update the task status 
> as part of the exception handling and fails with the below exception. This is 
> happening from all of these catch blocks statusUpdate calls, below are the 
> exceptions correspondingly for all these catch cases.
> {code:title=Executor.scala|borderStyle=solid}
> case _: TaskKilledException | _: InterruptedException if task.killed 
> =>
>  ..
> case cDE: CommitDeniedException =>
>  ..
> case t: Throwable =>
>  ..
> {code}
> {code:xml}
> 16/03/29 17:32:33 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-2,5,main]
> java.lang.Error: java.nio.channels.ClosedByInterruptException
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Caused by: java.nio.channels.ClosedByInterruptException
>   at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>   at 
> java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:460)
>   at 
> org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:49)
>   at 
> org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:47)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>   at 
> org.apache.spark.util.SerializableBuffer.writeObject(SerializableBuffer.scala:47)
>   at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:253)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:513)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.statusUpdate(CoarseGrainedExecutorBackend.scala:135)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   ... 2 more
> {code}
> {code:xml}
> 16/03/29 08:00:29 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-4,5,main]
> java.lang.Error: java.nio.channels.ClosedByInterruptException
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at