[jira] [Updated] (SPARK-21564) TaskDescription decoding failure should fail the task

Andrew Ash (JIRA) Fri, 28 Jul 2017 13:23:22 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-21564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrew Ash updated SPARK-21564:
-------------------------------
    Description: 
cc [~robert3005]

I was seeing an issue where Spark was throwing this exception:

{noformat}
16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox 
- Ignoring error
java.io.EOFException: null
    at java.io.DataInputStream.readFully(DataInputStream.java:197)
    at java.io.DataInputStream.readUTF(DataInputStream.java:609)
    at java.io.DataInputStream.readUTF(DataInputStream.java:564)
    at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
    at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
    at scala.collection.immutable.Range.foreach(Range.scala:160)
    at 
org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
    at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
    at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
    at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
    at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
    at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
{noformat}

For details on the cause of that exception, see SPARK-21563

We've since changed the application and have a proposed fix in Spark at the 
ticket above, but it was troubling that decoding the TaskDescription wasn't 
failing the tasks.  So the Spark job ended up hanging and making no progress, 
permanently stuck, because the driver thinks the task is running but the thread 
has died in the executor.

We should make a change around 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L96
 so that when that decode throws an exception, the task is marked as failed.

cc [~kayousterhout] [~irashid]

  was:
I was seeing an issue where Spark was throwing this exception:

{noformat}
16:16:28.294 [dispatcher-event-loop-14] ERROR org.apache.spark.rpc.netty.Inbox 
- Ignoring error
java.io.EOFException: null
    at java.io.DataInputStream.readFully(DataInputStream.java:197)
    at java.io.DataInputStream.readUTF(DataInputStream.java:609)
    at java.io.DataInputStream.readUTF(DataInputStream.java:564)
    at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
    at 
org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
    at scala.collection.immutable.Range.foreach(Range.scala:160)
    at 
org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
    at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
    at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
    at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
    at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
    at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
{noformat}

For details on the cause of that exception, see SPARK-21563

We've since changed the application and have a proposed fix in Spark at the 
ticket above, but it was troubling that decoding the TaskDescription wasn't 
failing the tasks.  So the Spark job ended up hanging and making no progress, 
permanently stuck, because the driver thinks the task is running but the thread 
has died in the executor.

We should make a change around 
https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L96
 so that when that decode throws an exception, the task is marked as failed.

cc [~kayousterhout] [~irashid]


> TaskDescription decoding failure should fail the task
> -----------------------------------------------------
>
>                 Key: SPARK-21564
>                 URL: https://issues.apache.org/jira/browse/SPARK-21564
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Andrew Ash
>
> cc [~robert3005]
> I was seeing an issue where Spark was throwing this exception:
> {noformat}
> 16:16:28.294 [dispatcher-event-loop-14] ERROR 
> org.apache.spark.rpc.netty.Inbox - Ignoring error
> java.io.EOFException: null
>     at java.io.DataInputStream.readFully(DataInputStream.java:197)
>     at java.io.DataInputStream.readUTF(DataInputStream.java:609)
>     at java.io.DataInputStream.readUTF(DataInputStream.java:564)
>     at 
> org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
>     at 
> org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
>     at scala.collection.immutable.Range.foreach(Range.scala:160)
>     at 
> org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
>     at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
>     at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
>     at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
>     at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
>     at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:748)
> {noformat}
> For details on the cause of that exception, see SPARK-21563
> We've since changed the application and have a proposed fix in Spark at the 
> ticket above, but it was troubling that decoding the TaskDescription wasn't 
> failing the tasks.  So the Spark job ended up hanging and making no progress, 
> permanently stuck, because the driver thinks the task is running but the 
> thread has died in the executor.
> We should make a change around 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L96
>  so that when that decode throws an exception, the task is marked as failed.
> cc [~kayousterhout] [~irashid]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21564) TaskDescription decoding failure should fail the task

Reply via email to