Roque Vassal'lo created SPARK-5497:
--------------------------------------

             Summary: start-all script not working properly on Standalone HA 
cluster (with Zookeeper)
                 Key: SPARK-5497
                 URL: https://issues.apache.org/jira/browse/SPARK-5497
             Project: Spark
          Issue Type: Bug
          Components: Deploy
    Affects Versions: 1.2.0
            Reporter: Roque Vassal'lo


I have configured a Standalone HA cluster with Zookeeper with:
- 3 Zookeeper nodes
- 2 Spark master nodes (1 alive and 1 in standby mode)
- 2 Spark slave nodes

While executing start-all.sh on each master, it will start the master and start 
a worker on each configured slave.
If alive master goes down, those worker are supposed to reconfigure themselves 
to use the new active master automatically.

I have noticed that the spark-env property SPARK_MASTER_IP is used in both 
called scripts, start-master and start-slaves.

The problem is that if you configure SPARK_MASTER_IP with the active master ip, 
when it goes down, workers don't reassign themselves to the new active master.
And if you configure SPARK_MASTER_IP with the masters cluster route (well, an 
approximation, because you have to write master's port in all-but-last ips, 
that is "master1:7077,master2", in order to make it work), slaves start 
properly but master doesn't.

So, the start-master script needs SPARK_MASTER_IP property to contain its ip in 
order to start master properly; and start-slaves script needs SPARK_MASTER_IP 
property to contain the masters cluster ips (that is "master1:7077,master2")

To test that idea, I have modified start-slaves and spark-env scripts on master 
nodes.
On spark-env.sh, I have set SPARK_MASTER_IP property to master's own ip on each 
master node (that is, on master node 1, SPARK_MASTER_IP=master1; and on master 
node 2, SPARK_MASTER_IP=master2)
On spark-env.sh, I have added a new property SPARK_MASTER_CLUSTER_IP with the 
pseudo-masters-cluster-ips (SPARK_MASTER_CLUSTER_IP=master1:7077,master2) on 
both masters.
On start-slaves.sh, I have modified all references to SPARK_MASTER_IP to 
SPARK_MASTER_CLUSTER_IP.
I have tried that and it works great! When active master node goes down, all 
workers reassign themselves to the new active node.

Maybe there is a better fix for this issue.
Hope this quick-fix idea can help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to