[ 
https://issues.apache.org/jira/browse/SPARK-5497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-5497:
--------------------------------
    Labels: Configuration Deployment Spark bulk-closed start  (was: 
Configuration Deployment Spark start)

> start-all script not working properly on Standalone HA cluster (with 
> Zookeeper)
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-5497
>                 URL: https://issues.apache.org/jira/browse/SPARK-5497
>             Project: Spark
>          Issue Type: Bug
>          Components: Deploy
>    Affects Versions: 1.2.0
>            Reporter: Roque Vassal'lo
>            Priority: Major
>              Labels: Configuration, Deployment, Spark, bulk-closed, start
>
> I have configured a Standalone HA cluster with Zookeeper with:
> - 3 Zookeeper nodes
> - 2 Spark master nodes (1 alive and 1 in standby mode)
> - 2 Spark slave nodes
> While executing start-all.sh on each master, it will start the master and 
> start a worker on each configured slave.
> If alive master goes down, those worker are supposed to reconfigure 
> themselves to use the new active master automatically.
> I have noticed that the spark-env property SPARK_MASTER_IP is used in both 
> called scripts, start-master and start-slaves.
> The problem is that if you configure SPARK_MASTER_IP with the active master 
> ip, when it goes down, workers don't reassign themselves to the new active 
> master.
> And if you configure SPARK_MASTER_IP with the masters cluster route (well, an 
> approximation, because you have to write master's port in all-but-last ips, 
> that is "master1:7077,master2", in order to make it work), slaves start 
> properly but master doesn't.
> So, the start-master script needs SPARK_MASTER_IP property to contain its ip 
> in order to start master properly; and start-slaves script needs 
> SPARK_MASTER_IP property to contain the masters cluster ips (that is 
> "master1:7077,master2")
> To test that idea, I have modified start-slaves and spark-env scripts on 
> master nodes.
> On spark-env.sh, I have set SPARK_MASTER_IP property to master's own ip on 
> each master node (that is, on master node 1, SPARK_MASTER_IP=master1; and on 
> master node 2, SPARK_MASTER_IP=master2)
> On spark-env.sh, I have added a new property SPARK_MASTER_CLUSTER_IP with the 
> pseudo-masters-cluster-ips (SPARK_MASTER_CLUSTER_IP=master1:7077,master2) on 
> both masters.
> On start-slaves.sh, I have modified all references to SPARK_MASTER_IP to 
> SPARK_MASTER_CLUSTER_IP.
> I have tried that and it works great! When active master node goes down, all 
> workers reassign themselves to the new active node.
> Maybe there is a better fix for this issue.
> Hope this quick-fix idea can help.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to