[ https://issues.apache.org/jira/browse/SPARK-24794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16544973#comment-16544973 ]
Ecaterina commented on SPARK-24794: ----------------------------------- Yes, I also face this problem. Would be nice if somebody could answer this. > DriverWrapper should have both master addresses in -Dspark.master > ----------------------------------------------------------------- > > Key: SPARK-24794 > URL: https://issues.apache.org/jira/browse/SPARK-24794 > Project: Spark > Issue Type: Bug > Components: Deploy > Affects Versions: 2.2.1 > Reporter: Behroz Sikander > Priority: Major > > In standalone cluster mode, one could launch a Driver with supervise mode > enabled. Spark launches the driver with a JVM argument -Dspark.master which > is set to [host and port of current > master|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L149]. > > During the life of context, the spark masters can switch due to any reason. > After that if the driver dies unexpectedly and comes up it tries to connect > with the master which was set initially with -Dspark.master but that master > is in STANDBY mode. The context tries multiple times to connect to standby > and then just kills itself. > > *Suggestion:* > While launching the driver process, Spark master should use the [spark.master > passed as > input|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L124] > instead of master and port of the current master. > Log messages that we observe: > > {code:java} > 2018-07-11 13:03:21,801 INFO appclient-register-master-threadpool-0 > org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: > Connecting to master spark://10.100.100.22:7077.. > ..... > 2018-07-11 13:03:21,806 INFO netty-rpc-connection-0 > org.apache.spark.network.client.TransportClientFactory []: Successfully > created connection to /10.100.100.22:7077 after 1 ms (0 ms spent in > bootstraps) > ..... > 2018-07-11 13:03:41,802 INFO appclient-register-master-threadpool-0 > org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: > Connecting to master spark://10.100.100.22:7077... > ..... > 2018-07-11 13:04:01,802 INFO appclient-register-master-threadpool-0 > org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: > Connecting to master spark://10.100.100.22:7077... > ..... > 2018-07-11 13:04:21,806 ERROR appclient-registration-retry-thread > org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend []: Application > has been killed. Reason: All masters are unresponsive! Giving up.{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org