Re: EC2 clusters ready in launch time + 30 seconds

Nicholas Chammas Mon, 06 Oct 2014 14:15:46 -0700

FYI: I've created SPARK-3821: Develop an automated way of creating Spark
images (AMI, Docker, and others)
<https://issues.apache.org/jira/browse/SPARK-3821>


On Mon, Oct 6, 2014 at 4:48 PM, Daniil Osipov <[email protected]>
wrote:

> I've also been looking at this. Basically, the Spark EC2 script is
> excellent for small development clusters of several nodes, but isn't
> suitable for production. It handles instance setup in a single threaded
> manner, while it can easily be parallelized. It also doesn't handle failure
> well, ex when an instance fails to start or is taking too long to respond.
>
> Our desire was to have an equivalent to Amazon EMR[1] API that would
> trigger Spark jobs, including specified cluster setup. I've done some work
> towards that end, and it would benefit from an updated AMI greatly.
>
> Dan
>
> [1]
> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-commands.html
>
> On Sat, Oct 4, 2014 at 7:28 AM, Nicholas Chammas <
> [email protected]> wrote:
>
>> Thanks for posting that script, Patrick. It looks like a good place to
>> start.
>>
>> Regarding Docker vs. Packer, as I understand it you can use Packer to
>> create Docker containers at the same time as AMIs and other image types.
>>
>> Nick
>>
>>
>> On Sat, Oct 4, 2014 at 2:49 AM, Patrick Wendell <[email protected]>
>> wrote:
>>
>> > Hey All,
>> >
>> > Just a couple notes. I recently posted a shell script for creating the
>> > AMI's from a clean Amazon Linux AMI.
>> >
>> > https://github.com/mesos/spark-ec2/blob/v3/create_image.sh
>> >
>> > I think I will update the AMI's soon to get the most recent security
>> > updates. For spark-ec2's purpose this is probably sufficient (we'll
>> > only need to re-create them every few months).
>> >
>> > However, it would be cool if someone wanted to tackle providing a more
>> > general mechanism for defining Spark-friendly "images" that can be
>> > used more generally. I had thought that docker might be a good way to
>> > go for something like this - but maybe this packer thing is good too.
>> >
>> > For one thing, if we had a standard image we could use it to create
>> > containers for running Spark's unit test, which would be really cool.
>> > This would help a lot with random issues around port and filesystem
>> > contention we have for unit tests.
>> >
>> > I'm not sure if the long term place for this would be inside the spark
>> > codebase or a community library or what. But it would definitely be
>> > very valuable to have if someone wanted to take it on.
>> >
>> > - Patrick
>> >
>> > On Fri, Oct 3, 2014 at 5:20 PM, Nicholas Chammas
>> > <[email protected]> wrote:
>> > > FYI: There is an existing issue -- SPARK-3314
>> > > <https://issues.apache.org/jira/browse/SPARK-3314> -- about scripting
>> > the
>> > > creation of Spark AMIs.
>> > >
>> > > With Packer, it looks like we may be able to script the creation of
>> > > multiple image types (VMWare, GCE, AMI, Docker, etc...) at once from a
>> > > single Packer template. That's very cool.
>> > >
>> > > I'll be looking into this.
>> > >
>> > > Nick
>> > >
>> > >
>> > > On Thu, Oct 2, 2014 at 8:23 PM, Nicholas Chammas <
>> > [email protected]
>> > >> wrote:
>> > >
>> > >> Thanks for the update, Nate. I'm looking forward to seeing how these
>> > >> projects turn out.
>> > >>
>> > >> David, Packer looks very, very interesting. I'm gonna look into it
>> more
>> > >> next week.
>> > >>
>> > >> Nick
>> > >>
>> > >>
>> > >> On Thu, Oct 2, 2014 at 8:00 PM, Nate D'Amico <[email protected]>
>> wrote:
>> > >>
>> > >>> Bit of progress on our end, bit of lagging as well.  Our guy leading
>> > >>> effort got little bogged down on client project to update hive/sql
>> > testbed
>> > >>> to latest spark/sparkSQL, also launching public service so we have
>> > been bit
>> > >>> scattered recently.
>> > >>>
>> > >>> Will have some more updates probably after next week.  We are
>> planning
>> > on
>> > >>> taking our client work around hive/spark, plus taking over the
>> bigtop
>> > >>> automation work to modernize and get that fit for human consumption
>> > outside
>> > >>> or org.  All our work and puppet modules will be open sourced,
>> > documented,
>> > >>> hopefully start to rally some other folks around effort that find it
>> > useful
>> > >>>
>> > >>> Side note, another effort we are looking into is gradle
>> tests/support.
>> > >>> We have been leveraging serverspec for some basic infrastructure
>> > tests, but
>> > >>> with bigtop switching over to gradle builds/testing setup in 0.8 we
>> > want to
>> > >>> include support for that in our own efforts, probably some stuff
>> that
>> > can
>> > >>> be learned and leveraged in spark world for repeatable/tested
>> > infrastructure
>> > >>>
>> > >>> If anyone has any specific automation questions to your environment
>> you
>> > >>> can drop me a line directly.., will try to help out best I can.
>> Else
>> > will
>> > >>> post update to dev list once we get on top of our own product
>> release
>> > and
>> > >>> the bigtop work
>> > >>>
>> > >>> Nate
>> > >>>
>> > >>>
>> > >>> -----Original Message-----
>> > >>> From: David Rowe [mailto:[email protected]]
>> > >>> Sent: Thursday, October 02, 2014 4:44 PM
>> > >>> To: Nicholas Chammas
>> > >>> Cc: dev; Shivaram Venkataraman
>> > >>> Subject: Re: EC2 clusters ready in launch time + 30 seconds
>> > >>>
>> > >>> I think this is exactly what packer is for. See e.g.
>> > >>> http://www.packer.io/intro/getting-started/build-image.html
>> > >>>
>> > >>> On a related note, the current AMI for hvm systems (e.g. m3.*, r3.*)
>> > has
>> > >>> a bad package for httpd, whcih causes ganglia not to start. For some
>> > reason
>> > >>> I can't get access to the raw AMI to fix it.
>> > >>>
>> > >>> On Fri, Oct 3, 2014 at 9:30 AM, Nicholas Chammas <
>> > >>> [email protected]
>> > >>> > wrote:
>> > >>>
>> > >>> > Is there perhaps a way to define an AMI programmatically? Like, a
>> > >>> > collection of base AMI id + list of required stuff to be
>> installed +
>> > >>> > list of required configuration changes. I'm guessing that's what
>> > >>> > people use things like Puppet, Ansible, or maybe also AWS
>> > >>> CloudFormation for, right?
>> > >>> >
>> > >>> > If we could do something like that, then with every new release of
>> > >>> > Spark we could quickly and easily create new AMIs that have
>> > everything
>> > >>> we need.
>> > >>> > spark-ec2 would only have to bring up the instances and do a
>> minimal
>> > >>> > amount of configuration, and the only thing we'd need to track in
>> the
>> > >>> > Spark repo is the code that defines what goes on the AMI, as well
>> as
>> > a
>> > >>> > list of the AMI ids specific to each release.
>> > >>> >
>> > >>> > I'm just thinking out loud here. Does this make sense?
>> > >>> >
>> > >>> > Nate,
>> > >>> >
>> > >>> > Any progress on your end with this work?
>> > >>> >
>> > >>> > Nick
>> > >>> >
>> > >>> >
>> > >>> > On Sun, Jul 13, 2014 at 8:53 PM, Shivaram Venkataraman <
>> > >>> > [email protected]> wrote:
>> > >>> >
>> > >>> > > It should be possible to improve cluster launch time if we are
>> > >>> > > careful about what commands we run during setup. One way to do
>> this
>> > >>> > > would be to walk down the list of things we do for cluster
>> > >>> > > initialization and see if there is anything we can do make
>> things
>> > >>> > > faster. Unfortunately this might
>> > >>> > be
>> > >>> > > pretty time consuming, but I don't know of a better strategy.
>> The
>> > >>> > > place
>> > >>> > to
>> > >>> > > start would be the setup.sh file at
>> > >>> > > https://github.com/mesos/spark-ec2/blob/v3/setup.sh
>> > >>> > >
>> > >>> > > Here are some things that take a lot of time and could be
>> improved:
>> > >>> > > 1. Creating swap partitions on all machines. We could check if
>> > there
>> > >>> > > is a way to get EC2 to always mount a swap partition 2. Copying
>> /
>> > >>> > > syncing things across slaves. The copy-dir script is called too
>> > many
>> > >>> > > times right now and each time it pauses for a few milliseconds
>> > >>> > > between slaves [1]. This could be improved by removing
>> unnecessary
>> > >>> > > copies 3. We could make less frequently used modules like
>> Tachyon,
>> > >>> > > persistent
>> > >>> > hdfs
>> > >>> > > not a part of the default setup.
>> > >>> > >
>> > >>> > > [1] https://github.com/mesos/spark-ec2/blob/v3/copy-dir.sh#L42
>> > >>> > >
>> > >>> > > Thanks
>> > >>> > > Shivaram
>> > >>> > >
>> > >>> > >
>> > >>> > >
>> > >>> > >
>> > >>> > > On Sat, Jul 12, 2014 at 7:02 PM, Nicholas Chammas <
>> > >>> > > [email protected]> wrote:
>> > >>> > >
>> > >>> > > > On Thu, Jul 10, 2014 at 8:10 PM, Nate D'Amico <
>> [email protected]
>> > >
>> > >>> > wrote:
>> > >>> > > >
>> > >>> > > > > Starting to work through some automation/config stuff for
>> spark
>> > >>> > > > > stack
>> > >>> > > on
>> > >>> > > > > EC2 with a project, will be focusing the work through the
>> > apache
>> > >>> > bigtop
>> > >>> > > > > effort to start, can then share with spark community
>> directly
>> > as
>> > >>> > things
>> > >>> > > > > progress if people are interested
>> > >>> > > >
>> > >>> > > >
>> > >>> > > > Let us know how that goes. I'm definitely interested in
>> hearing
>> > >>> more.
>> > >>> > > >
>> > >>> > > > Nick
>> > >>> > > >
>> > >>> > >
>> > >>> >
>> > >>>
>> > >>>
>> > >>
>> >
>>
>
>

Re: EC2 clusters ready in launch time + 30 seconds

Reply via email to