[ https://issues.apache.org/jira/browse/SPARK-768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038288#comment-14038288 ]
Raymond Liu commented on SPARK-768: ----------------------------------- Hi Reynold I am trying to figure out this issue. Here is my understanding: when the situation you mentioned happen. it means: the block is stored in memory level without serialization. otherwise, the execption alread been thown in previous steps. So under this condition, I can figure out two cases which might run into this problem : 1. the rdd is cached in memory, and as you mentioned, it got run on other node, in this case, it seems to me that the remote fetch operation of blockmanager will catch the exception in connectionManager and return None to cachemanager, then the task go to compute code path, though this lead to over compute and a second copy of block is stored. But this do not hang the task. and the job eventually got done. And I have write some cases to verify this. This case, we might find some solution to optimize it? 2. you are using BlockRDD in DStream case, and the storage level is Memory, Then upon compute of the BlockRDD on another node, the exception is thown, while in this case, I think the Task Executor will catch the exception and fail the task? So, either case seems to me will eventually finish the job. I am wondering which kind of case I am missing here which will lead to the hanging of the task, Can you kindly give me an example? > Fail a task when the remote block it is fetching is not serializable > -------------------------------------------------------------------- > > Key: SPARK-768 > URL: https://issues.apache.org/jira/browse/SPARK-768 > Project: Spark > Issue Type: Bug > Reporter: Reynold Xin > Assignee: Reynold Xin > > When a task is fetching a remote block (e.g. locality wait exceeded), and if > the block is not serializable, the task would hang. > The block manager should fail the task instead of hanging the task ... once > the task fails, eventually it will get scheduled to the local node to be > executed successfully. -- This message was sent by Atlassian JIRA (v6.2#6252)