Nicholas Chammas created SPARK-5189:
---------------------------------------

             Summary: Reorganize EC2 scripts so that nodes can be provisioned 
independent of Spark master
                 Key: SPARK-5189
                 URL: https://issues.apache.org/jira/browse/SPARK-5189
             Project: Spark
          Issue Type: Improvement
          Components: EC2
            Reporter: Nicholas Chammas


As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, 
then setting up all the slaves together. This includes broadcasting files from 
the lonely master to potentially hundreds of slaves.

There are 2 main problems with this approach:
# Broadcasting files from the master to all slaves using 
[{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] 
(e.g. during [ephemeral-hdfs 
init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36],
 or during [Spark 
setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3])
 takes a long time. This time increases as the number of slaves increases.
# It's more complicated to add slaves to an existing cluster (a la 
[SPARK-2008]), since slaves are only configured through the master during the 
setup of the master itself.

Logically, the operations we want to implement are:

* Provision a Spark node
* Join a node to a cluster (including an empty cluster) as either a master or a 
slave
* Remove a node from a cluster

We need our scripts to roughly be organized to match the above operations. The 
goals would be:
# When launching a cluster, enable all cluster nodes to be provisioned in 
parallel, removing the master-to-slave file broadcast bottleneck.
# Facilitate cluster modifications like adding or removing nodes.
# Enable exploration of infrastructure tools like 
[Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} 
internals and perhaps even allow us to build [one tool that launches Spark 
clusters on several different cloud 
platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw].

More concretely, the modifications we need to make are:
* Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with 
equivalent, slave-side operations.
* Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure it 
fully creates a node that can be used as either a master or slave.
* Create a new script, {{join-to-cluster.sh}}, that takes a provisioned node, 
configures it as a master or slave, and joins it to a cluster.
* Move any remaining logic in {{setup.sh}} up to {{spark_ec2.py}} and delete 
that script.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to