[ https://issues.apache.org/jira/browse/SPARK-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14174753#comment-14174753 ]
Andrew Ash commented on SPARK-3736: ----------------------------------- The configuration for Hadoop's retry policy was added in HDFS-3504 {quote} + * Return the default retry policy used in RPC. + * + * If dfs.client.retry.policy.enabled == false, use TRY_ONCE_THEN_FAIL. + * + * Otherwise, first unwrap ServiceException if possible, and then + * (1) use multipleLinearRandomRetry for + * - SafeModeException, or + * - IOException other than RemoteException, or + * - ServiceException; and + * (2) use TRY_ONCE_THEN_FAIL for + * - non-SafeMode RemoteException, or + * - non-IOException. + * + * Note that dfs.client.retry.max < 0 is not allowed. {quote} >From >https://github.com/apache/hadoop/commit/45fafc2b8fc1aab0a082600b0d50ad693491ea70#diff-36b19e9d8816002ed9dff8580055d3fbR44 > it looks like the default policy is to retry every 10 seconds for 6 attempts, >and then every 60 seconds for 10 attempts. > Workers should reconnect to Master if disconnected > -------------------------------------------------- > > Key: SPARK-3736 > URL: https://issues.apache.org/jira/browse/SPARK-3736 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.0.2, 1.1.0 > Reporter: Andrew Ash > Assignee: Matthew Cheah > Priority: Critical > > In standalone mode, when a worker gets disconnected from the master for some > reason it never attempts to reconnect. In this situation you have to bounce > the worker before it will reconnect to the master. > The preferred alternative is to follow what Hadoop does -- when there's a > disconnect, attempt to reconnect at a particular interval until successful (I > think it repeats indefinitely every 10sec). > This has been observed by: > - [~pkolaczk] in > http://apache-spark-user-list.1001560.n3.nabble.com/Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td6240.html > - [~romi-totango] in > http://apache-spark-user-list.1001560.n3.nabble.com/Re-Workers-disconnected-from-master-sometimes-and-never-reconnect-back-td15335.html > - [~aash] -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org