[jira] [Commented] (SPARK-21564) TaskDescription decoding failure should fail the task

Sandeep Katta (Jira) Tue, 02 Mar 2021 23:37:09 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-21564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294347#comment-17294347
 ]


Sandeep Katta commented on SPARK-21564:
---------------------------------------

Recently I have hit with decode error, irony is all the tasks in the same 
taskset were able to deserialize but only 1 task is failed .

Which says that the data is corrupted, most of the times it will be very 
difficult to analyze why the data is corrupted , so for these kind of 
intermittent issue exception handling should be in place to achieve fault 
tolerant

 

*21/02/11 07:53:39 ERROR Inbox: Ignoring errorjava.io.UTFDataFormatException: 
malformed input around byte 5 at* 
java.io.DataInputStream.readUTF(DataInputStream.java:656) at 
java.io.DataInputStream.readUTF(DataInputStream.java:564) at 
org.apache.spark.scheduler.TaskDescription$$anonfun$deserializeStringLongMap$1.apply$mcVI$sp(TaskDescription.scala:110)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at 
org.apache.spark.scheduler.TaskDescription$.deserializeStringLongMap(TaskDescription.scala:109)
 at 
org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:125) 
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:100)
 at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
 at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at 
org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:226) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)

 

!image-2021-03-03-13-02-31-744.png!

 

So it's better to have fault tolerant in place, Spark Driver does not have any 
idea about this exception so it still waits for this task to complete, thus the 
job is in zombie stage

 

CC [~dongjoon] [~hyukjin.kwon] [~cloud_fan] tagging you guys for more traction

> TaskDescription decoding failure should fail the task
> -----------------------------------------------------
>
>                 Key: SPARK-21564
>                 URL: https://issues.apache.org/jira/browse/SPARK-21564
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Andrew Ash
>            Priority: Major
>              Labels: bulk-closed
>         Attachments: image-2021-03-03-13-02-06-669.png, 
> image-2021-03-03-13-02-31-744.png
>
>
> cc [~robert3005]
> I was seeing an issue where Spark was throwing this exception:
> {noformat}
> 16:16:28.294 [dispatcher-event-loop-14] ERROR 
> org.apache.spark.rpc.netty.Inbox - Ignoring error
> java.io.EOFException: null
>     at java.io.DataInputStream.readFully(DataInputStream.java:197)
>     at java.io.DataInputStream.readUTF(DataInputStream.java:609)
>     at java.io.DataInputStream.readUTF(DataInputStream.java:564)
>     at 
> org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
>     at 
> org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
>     at scala.collection.immutable.Range.foreach(Range.scala:160)
>     at 
> org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
>     at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
>     at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
>     at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
>     at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
>     at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:748)
> {noformat}
> For details on the cause of that exception, see SPARK-21563
> We've since changed the application and have a proposed fix in Spark at the 
> ticket above, but it was troubling that decoding the TaskDescription wasn't 
> failing the tasks.  So the Spark job ended up hanging and making no progress, 
> permanently stuck, because the driver thinks the task is running but the 
> thread has died in the executor.
> We should make a change around 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L96
>  so that when that decode throws an exception, the task is marked as failed.
> cc [~kayousterhout] [~irashid]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21564) TaskDescription decoding failure should fail the task

Reply via email to