[ 
https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174753#comment-14174753
 ] 

Andrew Ash commented on SPARK-3736:
-----------------------------------

The configuration for Hadoop's retry policy was added in HDFS-3504

{quote}
+   * Return the default retry policy used in RPC.
+   * 
+   * If dfs.client.retry.policy.enabled == false, use TRY_ONCE_THEN_FAIL.
+   * 
+   * Otherwise, first unwrap ServiceException if possible, and then 
+   * (1) use multipleLinearRandomRetry for
+   *     - SafeModeException, or
+   *     - IOException other than RemoteException, or
+   *     - ServiceException; and
+   * (2) use TRY_ONCE_THEN_FAIL for
+   *     - non-SafeMode RemoteException, or
+   *     - non-IOException.
+   *     
+   * Note that dfs.client.retry.max < 0 is not allowed.
{quote}

>From 
>https://github.com/apache/hadoop/commit/45fafc2b8fc1aab0a082600b0d50ad693491ea70#diff-36b19e9d8816002ed9dff8580055d3fbR44
> it looks like the default policy is to retry every 10 seconds for 6 attempts, 
>and then every 60 seconds for 10 attempts.

> Workers should reconnect to Master if disconnected
> --------------------------------------------------
>
>                 Key: SPARK-3736
>                 URL: https://issues.apache.org/jira/browse/SPARK-3736
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.2, 1.1.0
>            Reporter: Andrew Ash
>            Assignee: Matthew Cheah
>            Priority: Critical
>
> In standalone mode, when a worker gets disconnected from the master for some 
> reason it never attempts to reconnect.  In this situation you have to bounce 
> the worker before it will reconnect to the master.
> The preferred alternative is to follow what Hadoop does -- when there's a 
> disconnect, attempt to reconnect at a particular interval until successful (I 
> think it repeats indefinitely every 10sec).
> This has been observed by:
> - [~pkolaczk] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html
> - [~romi-totango] in 
> http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html
> - [~aash]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to