[ 
https://issues.apache.org/jira/browse/SPARK-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15119220#comment-15119220
 ] 

Nicholas Chammas commented on SPARK-5189:
-----------------------------------------

FWIW, I found this issue to be practically unsolvable without rewriting most of 
spark-ec2, so I started a new project that aims to replace spark-ec2 for most 
of its use cases: [Flintrock|https://github.com/nchammas/flintrock]

> Reorganize EC2 scripts so that nodes can be provisioned independent of Spark 
> master
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-5189
>                 URL: https://issues.apache.org/jira/browse/SPARK-5189
>             Project: Spark
>          Issue Type: Improvement
>          Components: EC2
>            Reporter: Nicholas Chammas
>
> As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, 
> then setting up all the slaves together. This includes broadcasting files 
> from the lonely master to potentially hundreds of slaves.
> There are 2 main problems with this approach:
> # Broadcasting files from the master to all slaves using 
> [{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] 
> (e.g. during [ephemeral-hdfs 
> init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36],
>  or during [Spark 
> setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3])
>  takes a long time. This time increases as the number of slaves increases.
>  I did some testing in {{us-east-1}}. This is, concretely, what the problem 
> looks like:
>  || number of slaves ({{m3.large}}) || launch time (best of 6 tries) ||
> | 1 | 8m 44s |
> | 10 | 13m 45s |
> | 25 | 22m 50s |
> | 50 | 37m 30s |
> | 75 | 51m 30s |
> | 99 | 1h 5m 30s |
>  Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, 
> but I think the point is clear enough.
>  We can extrapolate from this data that *every additional slave adds roughly 
> 35 seconds to the launch time* (so a cluster with 100 slaves would take 1h 6m 
> 5s to launch).
> # It's more complicated to add slaves to an existing cluster (a la 
> [SPARK-2008]), since slaves are only configured through the master during the 
> setup of the master itself.
> Logically, the operations we want to implement are:
> * Provision a Spark node
> * Join a node to a cluster (including an empty cluster) as either a master or 
> a slave
> * Remove a node from a cluster
> We need our scripts to roughly be organized to match the above operations. 
> The goals would be:
> # When launching a cluster, enable all cluster nodes to be provisioned in 
> parallel, removing the master-to-slave file broadcast bottleneck.
> # Facilitate cluster modifications like adding or removing nodes.
> # Enable exploration of infrastructure tools like 
> [Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} 
> internals and perhaps even allow us to build [one tool that launches Spark 
> clusters on several different cloud 
> platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw].
> More concretely, the modifications we need to make are:
> * Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with 
> equivalent, slave-side operations.
> * Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure 
> it fully creates a node that can be used as either a master or slave.
> * Create a new script, {{join-to-cluster.sh}}, that takes a provisioned node, 
> configures it as a master or slave, and joins it to a cluster.
> * Move any remaining logic in {{setup.sh}} up to {{spark_ec2.py}} and delete 
> that script.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to