On the ec2 machines, you can update the slaves from the master using something like "~/spark-ec2/copy-dir ~/spark".
Spark's TorrentBroadcast relies on the Block Manager to distribute blocks, making it relatively hard to extract. On Mon, May 19, 2014 at 12:36 AM, Daniel Mahler <dmah...@gmail.com> wrote: > btw is there a command or script to update the slaves from the master? > > thanks > Daniel > > > On Mon, May 19, 2014 at 1:48 AM, Andrew Ash <and...@andrewash.com> wrote: > >> If the codebase for Spark's broadcast is pretty self-contained, you could >> consider creating a small bootstrap sent out via the doubling rsync >> strategy that Mosharaf outlined above (called "Tree D=2" in the paper) that >> then pulled the larger >> >> Mosharaf, do you have a sense of whether the gains from using Cornet vs >> Tree D=2 with rsync outweighs the overhead of using a 2-phase broadcast >> mechanism? >> >> Andrew >> >> >> On Sun, May 18, 2014 at 11:32 PM, Aaron Davidson <ilike...@gmail.com>wrote: >> >>> One issue with using Spark itself is that this rsync is required to get >>> Spark to work... >>> >>> Also note that a similar strategy is used for *updating* the spark >>> cluster on ec2, where the "diff" aspect is much more important, as you >>> might only make a small change on the driver node (recompile or >>> reconfigure) and can get a fast sync. >>> >>> >>> On Sun, May 18, 2014 at 11:22 PM, Mosharaf Chowdhury < >>> mosharafka...@gmail.com> wrote: >>> >>>> What twitter calls murder, unless it has changed since then, is just a >>>> BitTornado wrapper. In 2011, We did some comparison on the performance of >>>> murder and the TorrentBroadcast we have right now for Spark's own broadcast >>>> (Section 7.1 in >>>> http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf). >>>> Spark's implementation was 4.5X faster than murder. >>>> >>>> The only issue with using TorrentBroadcast to deploy code/VM is writing >>>> a wrapper around it to read from disk, but it shouldn't be too complicated. >>>> If someone picks it up, I can give some pointers on how to proceed (I've >>>> thought about doing it myself forever, but never ended up actually taking >>>> the time; right now I don't have enough free cycles either) >>>> >>>> Otherwise, murder/BitTornado would be better than the current strategy >>>> we have. >>>> >>>> A third option would be to use rsync; but instead of rsync-ing to every >>>> slave from the master, one can simply rsync from the master first to one >>>> slave; then use the two sources (master and the first slave) to rsync to >>>> two more; then four and so on. Might be a simpler solution without many >>>> changes. >>>> >>>> -- >>>> Mosharaf Chowdhury >>>> http://www.mosharaf.com/ >>>> >>>> >>>> On Sun, May 18, 2014 at 11:07 PM, Andrew Ash <and...@andrewash.com>wrote: >>>> >>>>> My first thought would be to use libtorrent for this setup, and it >>>>> turns out that both Twitter and Facebook do code deploys with a bittorrent >>>>> setup. Twitter even released their code as open source: >>>>> >>>>> >>>>> https://blog.twitter.com/2010/murder-fast-datacenter-code-deploys-using-bittorrent >>>>> >>>>> >>>>> http://arstechnica.com/business/2012/04/exclusive-a-behind-the-scenes-look-at-facebook-release-engineering/ >>>>> >>>>> >>>>> On Sun, May 18, 2014 at 10:44 PM, Daniel Mahler <dmah...@gmail.com>wrote: >>>>> >>>>>> I am not an expert in this space either. I thought the initial rsync >>>>>> during launch is really just a straight copy that did not need the tree >>>>>> diff. So it seemed like having the slaves do the copying among it each >>>>>> other would be better than having the master copy to everyone directly. >>>>>> That made me think of bittorrent, though there may well be other systems >>>>>> that do this. >>>>>> From the launches I did today it seems that it is taking around 1 >>>>>> minute per slave to launch a cluster, which can be a problem for clusters >>>>>> with 10s or 100s of slaves, particularly since on ec2 that time has to >>>>>> be >>>>>> paid for. >>>>>> >>>>>> >>>>>> On Sun, May 18, 2014 at 11:54 PM, Aaron Davidson >>>>>> <ilike...@gmail.com>wrote: >>>>>> >>>>>>> Out of curiosity, do you have a library in mind that would make it >>>>>>> easy to setup a bit torrent network and distribute files in an rsync >>>>>>> (i.e., >>>>>>> apply a diff to a tree, ideally) fashion? I'm not familiar with this >>>>>>> space, >>>>>>> but we do want to minimize the complexity of our standard ec2 launch >>>>>>> scripts to reduce the chance of something breaking. >>>>>>> >>>>>>> >>>>>>> On Sun, May 18, 2014 at 9:22 PM, Daniel Mahler <dmah...@gmail.com>wrote: >>>>>>> >>>>>>>> I am launching a rather large cluster on ec2. >>>>>>>> It seems like the launch is taking forever on >>>>>>>> .... >>>>>>>> Setting up spark >>>>>>>> RSYNC'ing /root/spark to slaves... >>>>>>>> ... >>>>>>>> >>>>>>>> It seems that bittorrent might be a faster way to replicate >>>>>>>> the sizeable spark directory to the slaves >>>>>>>> particularly if there is a lot of not very powerful slaves. >>>>>>>> >>>>>>>> Just a thought ... >>>>>>>> >>>>>>>> cheers >>>>>>>> Daniel >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >