[jira] [Commented] (SPARK-3736) Workers should reconnect to Master if disconnected

2014-10-16 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174753#comment-14174753
 ] 

Andrew Ash commented on SPARK-3736:
---

The configuration for Hadoop's retry policy was added in HDFS-3504

{quote}
+   * Return the default retry policy used in RPC.
+   * 
+   * If dfs.client.retry.policy.enabled == false, use TRY_ONCE_THEN_FAIL.
+   * 
+   * Otherwise, first unwrap ServiceException if possible, and then 
+   * (1) use multipleLinearRandomRetry for
+   * - SafeModeException, or
+   * - IOException other than RemoteException, or
+   * - ServiceException; and
+   * (2) use TRY_ONCE_THEN_FAIL for
+   * - non-SafeMode RemoteException, or
+   * - non-IOException.
+   * 
+   * Note that dfs.client.retry.max < 0 is not allowed.
{quote}

>From 
>https://github.com/apache/hadoop/commit/45fafc2b8fc1aab0a082600b0d50ad693491ea70#diff-36b19e9d8816002ed9dff8580055d3fbR44
> it looks like the default policy is to retry every 10 seconds for 6 attempts, 
>and then every 60 seconds for 10 attempts.

> Workers should reconnect to Master if disconnected
> --
>
> Key: SPARK-3736
> URL: https://issues.apache.org/jira/browse/SPARK-3736
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Andrew Ash
>Assignee: Matthew Cheah
>Priority: Critical
>
> In standalone mode, when a worker gets disconnected from the master for some 
> reason it never attempts to reconnect.  In this situation you have to bounce 
> the worker before it will reconnect to the master.
> The preferred alternative is to follow what Hadoop does -- when there's a 
> disconnect, attempt to reconnect at a particular interval until successful (I 
> think it repeats indefinitely every 10sec).
> This has been observed by:
> - [~pkolaczk] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html
> - [~romi-totango] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html
> - [~aash]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3736) Workers should reconnect to Master if disconnected

2014-10-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174049#comment-14174049
 ] 

Apache Spark commented on SPARK-3736:
-

User 'mccheah' has created a pull request for this issue:
https://github.com/apache/spark/pull/2828

> Workers should reconnect to Master if disconnected
> --
>
> Key: SPARK-3736
> URL: https://issues.apache.org/jira/browse/SPARK-3736
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Andrew Ash
>Assignee: Matthew Cheah
>Priority: Critical
>
> In standalone mode, when a worker gets disconnected from the master for some 
> reason it never attempts to reconnect.  In this situation you have to bounce 
> the worker before it will reconnect to the master.
> The preferred alternative is to follow what Hadoop does -- when there's a 
> disconnect, attempt to reconnect at a particular interval until successful (I 
> think it repeats indefinitely every 10sec).
> This has been observed by:
> - [~pkolaczk] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html
> - [~romi-totango] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html
> - [~aash]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3736) Workers should reconnect to Master if disconnected

2014-10-14 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171360#comment-14171360
 ] 

Nan Zhu commented on SPARK-3736:


BTW, master will not send heartbeat to Worker proactively 

> Workers should reconnect to Master if disconnected
> --
>
> Key: SPARK-3736
> URL: https://issues.apache.org/jira/browse/SPARK-3736
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Andrew Ash
>Assignee: Matthew Cheah
>Priority: Critical
>
> In standalone mode, when a worker gets disconnected from the master for some 
> reason it never attempts to reconnect.  In this situation you have to bounce 
> the worker before it will reconnect to the master.
> The preferred alternative is to follow what Hadoop does -- when there's a 
> disconnect, attempt to reconnect at a particular interval until successful (I 
> think it repeats indefinitely every 10sec).
> This has been observed by:
> - [~pkolaczk] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html
> - [~romi-totango] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html
> - [~aash]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3736) Workers should reconnect to Master if disconnected

2014-10-14 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171358#comment-14171358
 ] 

Nan Zhu commented on SPARK-3736:


if the worker itself timeout, the Master will remove the worker from 
idToWorker, 

when the worker is resumed later and sends heartbeat to Master again, Master 
detect this by attempting to find worker in idToWorker (search "logWarning("Got 
heartbeat from unregistered worker " + workerId)" in Master.scala)

you can simply replace logWarning with the logic of sending a message to worker 
to ask it to re-register


> Workers should reconnect to Master if disconnected
> --
>
> Key: SPARK-3736
> URL: https://issues.apache.org/jira/browse/SPARK-3736
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Andrew Ash
>Assignee: Matthew Cheah
>Priority: Critical
>
> In standalone mode, when a worker gets disconnected from the master for some 
> reason it never attempts to reconnect.  In this situation you have to bounce 
> the worker before it will reconnect to the master.
> The preferred alternative is to follow what Hadoop does -- when there's a 
> disconnect, attempt to reconnect at a particular interval until successful (I 
> think it repeats indefinitely every 10sec).
> This has been observed by:
> - [~pkolaczk] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html
> - [~romi-totango] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html
> - [~aash]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3736) Workers should reconnect to Master if disconnected

2014-10-14 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171331#comment-14171331
 ] 

Matt Cheah commented on SPARK-3736:
---

I was curious if anyone had any feedback on my above comment?

> Workers should reconnect to Master if disconnected
> --
>
> Key: SPARK-3736
> URL: https://issues.apache.org/jira/browse/SPARK-3736
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Andrew Ash
>Assignee: Matthew Cheah
>Priority: Critical
>
> In standalone mode, when a worker gets disconnected from the master for some 
> reason it never attempts to reconnect.  In this situation you have to bounce 
> the worker before it will reconnect to the master.
> The preferred alternative is to follow what Hadoop does -- when there's a 
> disconnect, attempt to reconnect at a particular interval until successful (I 
> think it repeats indefinitely every 10sec).
> This has been observed by:
> - [~pkolaczk] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html
> - [~romi-totango] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html
> - [~aash]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3736) Workers should reconnect to Master if disconnected

2014-10-09 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165513#comment-14165513
 ] 

Matt Cheah commented on SPARK-3736:
---

Are the two linked cases above different though?

(1) If the worker itself gets locked up, the master sends a heartbeat but the 
worker doesn't respond, and the master drops the connection with the worker. 
However the master doesn't send a message to the worker indicating this 
disconnection, so the worker can't know to reconnect. To repro this I set a 
breakpoint in the Worker's heartbeat reception code and let the worker time 
out, and after the worker times out it never receives a DissassociatedEvent, 
nor is Worker.masterDisconnected() ever called.

(2) If the master crashes, the Worker receives a DissassociatedEvent and sits 
idly. We can fix this with actively attempting to reconnect.

Clearly we can address the second case with the Worker actively trying to 
reconnect itself. But how can we address the first case?

> Workers should reconnect to Master if disconnected
> --
>
> Key: SPARK-3736
> URL: https://issues.apache.org/jira/browse/SPARK-3736
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Andrew Ash
>Assignee: Matthew Cheah
>Priority: Critical
>
> In standalone mode, when a worker gets disconnected from the master for some 
> reason it never attempts to reconnect.  In this situation you have to bounce 
> the worker before it will reconnect to the master.
> The preferred alternative is to follow what Hadoop does -- when there's a 
> disconnect, attempt to reconnect at a particular interval until successful (I 
> think it repeats indefinitely every 10sec).
> This has been observed by:
> - [~pkolaczk] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html
> - [~romi-totango] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html
> - [~aash]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3736) Workers should reconnect to Master if disconnected

2014-10-08 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164578#comment-14164578
 ] 

Patrick Wendell commented on SPARK-3736:


I spoke a bit offline with [~ilikerps] about this. I think the solution here is 
pretty simple - if the worker disconnects it should just try to re-initialize 
the connection to all drivers. It might need some slight refactoring so that on 
re-connect it will do this for an infinite number of attempts and checking to 
make sure there aren't races.

A good first step would be to get a grasp of how the general fault tolerance 
code works here around connections (there is a bit of complexity here around 
having failover between masters). Checkout the documentation on the Spark 
website about standalone fault tolerance. Right now the worker will simply hang 
out and do nothing when it loses the connection to the master, because it's 
expecting another master to re-connect to it. But this won't occur during the 
case where there is master failure.

> Workers should reconnect to Master if disconnected
> --
>
> Key: SPARK-3736
> URL: https://issues.apache.org/jira/browse/SPARK-3736
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Andrew Ash
>Priority: Critical
>
> In standalone mode, when a worker gets disconnected from the master for some 
> reason it never attempts to reconnect.  In this situation you have to bounce 
> the worker before it will reconnect to the master.
> The preferred alternative is to follow what Hadoop does -- when there's a 
> disconnect, attempt to reconnect at a particular interval until successful (I 
> think it repeats indefinitely every 10sec).
> This has been observed by:
> - [~pkolaczk] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html
> - [~romi-totango] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html
> - [~aash]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3736) Workers should reconnect to Master if disconnected

2014-09-29 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152594#comment-14152594
 ] 

Andrew Ash commented on SPARK-3736:
---

I can't tell for sure but this is possibly related to SPARK-704 or SPARK-1771

> Workers should reconnect to Master if disconnected
> --
>
> Key: SPARK-3736
> URL: https://issues.apache.org/jira/browse/SPARK-3736
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Andrew Ash
>
> In standalone mode, when a worker gets disconnected from the master for some 
> reason it never attempts to reconnect.  In this situation you have to bounce 
> the worker before it will reconnect to the master.
> The preferred alternative is to follow what Hadoop does -- when there's a 
> disconnect, attempt to reconnect at a particular interval until successful (I 
> think it repeats indefinitely every 10sec).
> This has been observed by:
> - [~pkolaczk] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html
> - [~romi-totango] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html
> - [~aash]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org