That is a good idea, though I am not sure how much it will help as time to
rsync is also dependent just on data size being copied. The other problem
is that sometime we have dependencies across packages, so the first needs
to be running before the second can start etc.
However I agree that it takes too long to launch say a 100 node cluster
right now. If you want to take a shot at trying out some changes, you can
fork the spark-ec2 repo at https://github.com/mesos/spark-ec2/tree/v2  and
modify the number of rsync calls (each call to /root/spark-ec2/copy-dir
launches an rsync now).

Thanks
Shivaram


On Sun, Mar 30, 2014 at 3:12 PM, Aureliano Buendia <buendia...@gmail.com>wrote:

> Hi,
>
> Spark-ec2 uses rsync to deploy many applications. It seem over time more
> and more applications have been added to the script, which has
> significantly slowed down the setup time.
>
> Perhaps the script could be restructured this this way: Instead of
> rsyncing N times per application, we could have 1 rsync which deploys N
> applications.
>
> This should remarkably speed up the setup part, specially for clusters
> with many nodes.
>

Reply via email to