[ 
https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165513#comment-14165513
 ] 

Matt Cheah commented on SPARK-3736:
-----------------------------------

Are the two linked cases above different though?

(1) If the worker itself gets locked up, the master sends a heartbeat but the 
worker doesn't respond, and the master drops the connection with the worker. 
However the master doesn't send a message to the worker indicating this 
disconnection, so the worker can't know to reconnect. To repro this I set a 
breakpoint in the Worker's heartbeat reception code and let the worker time 
out, and after the worker times out it never receives a DissassociatedEvent, 
nor is Worker.masterDisconnected() ever called.

(2) If the master crashes, the Worker receives a DissassociatedEvent and sits 
idly. We can fix this with actively attempting to reconnect.

Clearly we can address the second case with the Worker actively trying to 
reconnect itself. But how can we address the first case?

> Workers should reconnect to Master if disconnected
> --------------------------------------------------
>
>                 Key: SPARK-3736
>                 URL: https://issues.apache.org/jira/browse/SPARK-3736
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.2, 1.1.0
>            Reporter: Andrew Ash
>            Assignee: Matthew Cheah
>            Priority: Critical
>
> In standalone mode, when a worker gets disconnected from the master for some 
> reason it never attempts to reconnect.  In this situation you have to bounce 
> the worker before it will reconnect to the master.
> The preferred alternative is to follow what Hadoop does -- when there's a 
> disconnect, attempt to reconnect at a particular interval until successful (I 
> think it repeats indefinitely every 10sec).
> This has been observed by:
> - [~pkolaczk] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html
> - [~romi-totango] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html
> - [~aash]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to