[ 
https://issues.apache.org/jira/browse/SPARK-24794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16544973#comment-16544973
 ] 

Ecaterina commented on SPARK-24794:
-----------------------------------

Yes, I also face this problem. Would be nice if somebody could answer this.

> DriverWrapper should have both master addresses in -Dspark.master
> -----------------------------------------------------------------
>
>                 Key: SPARK-24794
>                 URL: https://issues.apache.org/jira/browse/SPARK-24794
>             Project: Spark
>          Issue Type: Bug
>          Components: Deploy
>    Affects Versions: 2.2.1
>            Reporter: Behroz Sikander
>            Priority: Major
>
> In standalone cluster mode, one could launch a Driver with supervise mode 
> enabled. Spark launches the driver with a JVM argument -Dspark.master which 
> is set to [host and port of current 
> master|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L149].
>  
> During the life of context, the spark masters can switch due to any reason. 
> After that if the driver dies unexpectedly and comes up it tries to connect 
> with the master which was set initially with -Dspark.master but that master 
> is in STANDBY mode. The context tries multiple times to connect to standby 
> and then just kills itself.
>  
> *Suggestion:*
> While launching the driver process, Spark master should use the [spark.master 
> passed as 
> input|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/StandaloneRestServer.scala#L124]
>  instead of master and port of the current master.
> Log messages that we observe:
>  
> {code:java}
> 2018-07-11 13:03:21,801 INFO appclient-register-master-threadpool-0 
> org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: 
> Connecting to master spark://10.100.100.22:7077..
> .....
> 2018-07-11 13:03:21,806 INFO netty-rpc-connection-0 
> org.apache.spark.network.client.TransportClientFactory []: Successfully 
> created connection to /10.100.100.22:7077 after 1 ms (0 ms spent in 
> bootstraps)
> .....
> 2018-07-11 13:03:41,802 INFO appclient-register-master-threadpool-0 
> org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: 
> Connecting to master spark://10.100.100.22:7077...
> .....
> 2018-07-11 13:04:01,802 INFO appclient-register-master-threadpool-0 
> org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []: 
> Connecting to master spark://10.100.100.22:7077...
> .....
> 2018-07-11 13:04:21,806 ERROR appclient-registration-retry-thread 
> org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend []: Application 
> has been killed. Reason: All masters are unresponsive! Giving up.{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to