[ 
https://issues.apache.org/jira/browse/SPARK-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038288#comment-14038288
 ] 

Raymond Liu commented on SPARK-768:
-----------------------------------

Hi Reynold

I am trying to figure out this issue. Here is my understanding: when the 
situation you mentioned happen. it means: the block is stored in memory level 
without serialization. otherwise, the execption alread been thown in previous 
steps. So under this condition, I can figure out two cases which might run into 
this problem : 

1. the rdd is cached in memory, and as you mentioned, it got run on other node, 
in this case, it seems to me that the remote fetch operation of blockmanager 
will catch the exception in connectionManager and return None to cachemanager, 
then the task go to compute code path, though this lead to over compute and a 
second copy of block is stored. But this do not hang the task. and the job 
eventually got done.  And I have write some cases to verify this. This case, we 
might find some solution to optimize it?

2. you are using BlockRDD in DStream case, and the storage level is Memory, 
Then upon compute of the BlockRDD on another node, the exception is thown, 
while in this case, I think the Task Executor will catch the exception and fail 
the task?

So, either case seems to me  will eventually finish the job. I am wondering 
which kind of case I am missing here which will lead to the hanging of the 
task, Can you kindly give me an example?

> Fail a task when the remote block it is fetching is not serializable
> --------------------------------------------------------------------
>
>                 Key: SPARK-768
>                 URL: https://issues.apache.org/jira/browse/SPARK-768
>             Project: Spark
>          Issue Type: Bug
>            Reporter: Reynold Xin
>            Assignee: Reynold Xin
>
> When a task is fetching a remote block (e.g. locality wait exceeded), and if 
> the block is not serializable, the task would hang.
> The block manager should fail the task instead of hanging the task ... once 
> the task fails, eventually it will get scheduled to the local node to be 
> executed successfully. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to