Re: Should spark-ec2 get its own repo?

Sean Owen Wed, 15 Jul 2015 01:09:41 -0700

The code can continue to be a good reference implementation, no matter
where it lives. In fact, it can be a better more complete one, and
easier to update.


I agree that ec2/ needs to retain some kind of pointer to the new
location. Yes, maybe a script as well that does the checkout as you
say. We have to be careful that the effect here isn't to make people
think this code is still part of the blessed bits of a Spark release,
since it isn't. But I suppose the point is that it isn't quite now
either (isn't tested, isn't fully contained in apache/spark) and
that's what we're fixing.

I still don't like the idea of using the ASF JIRA for Spark to track
issues in a separate project, as these kinds of splits are what we're
trying to get rid of. I think it's a plus to be able to only bother
with the Github PR/issue system, and not parallel JIRAs as well. I
also worry that this blurs the line between code that is formally
tested and blessed in a Spark release, and that which is not. You fix
an issue in this separate repo and marked it "fixed in Spark 1.5" --
what does that imply?

I think the issue is people don't like the sense this is getting
pushed outside the wall, or 'removed' from Spark. On the one hand I
argue it hasn't really properly been part of Spark -- that's why we
need this change to happen. But, I also think this is easy to resolve
other ways: spark-packages.org, the pointer in the repo, prominent
notes in the wiki, etc.

I suggest Shivaram owns this, and that amplab/spark-ec2 is used to
host? I'm not qualified to help make the new copy or repo admin but
would be happy to help with the rest, like triaging, if you can give
me rights to open issues.


On Wed, Jul 15, 2015 at 5:35 AM, Matt Goodman <meawo...@gmail.com> wrote:
> I concur with the things Sean said about keeping the same JIRA.  Frankly,
> its a pretty small part of spark, and as mentioned by Nicholas, a reference
> implementation of getting Spark running in ec2.
>
> I can see wanting to grow it to a little more general tool that implements
> launchers for other compute platforms.  Porting this over to
> Google/M$/rackspace offerings would be not too far out of reach.
>
> --Matthew Goodman
>
> =====================
> Check Out My Website: http://craneium.net
> Find me on LinkedIn: http://tinyurl.com/d6wlch
>
> On Mon, Jul 13, 2015 at 2:46 PM, Nicholas Chammas
> <nicholas.cham...@gmail.com> wrote:
>>
>> > At a high level I see the spark-ec2 scripts as an effort to provide a
>> > reference implementation for launching EC2 clusters with Apache Spark
>>
>> On a side note, this is precisely how I used spark-ec2 for a personal
>> project that does something similar: reference implementation.
>>
>> Nick
>> 2015년 7월 13일 (월) 오후 1:27, Shivaram Venkataraman
>> <shiva...@eecs.berkeley.edu>님이 작성:
>>>
>>> I think moving the repo-location and re-organizing the python code to
>>> handle dependencies, testing etc. sounds good to me. However, I think there
>>> are a couple of things which I am not sure about
>>>
>>> 1. I strongly believe that we should preserve existing command-line in
>>> ec2/spark-ec2 (i.e. the shell script not the python file). This could be a
>>> thin wrapper script that just checks out the or downloads something (similar
>>> to say build/mvn). Mainly, I see no reason to break the workflow that users
>>> are used to right now.
>>>
>>> 2. I am also not sure about that moving the issue tracker is necessarily
>>> a good idea. I don't think we get a large number of issues due to the EC2
>>> stuff  and if we do have a workflow for launching EC2 clusters, the Spark
>>> JIRA would still be the natural place to report issues related to this.
>>>
>>> At a high level I see the spark-ec2 scripts as an effort to provide a
>>> reference implementation for launching EC2 clusters with Apache Spark --
>>> Given this view I am not sure it makes sense to completely decouple this
>>> from the Apache project.
>>>
>>> Thanks
>>> Shivaram
>>>
>>> On Sun, Jul 12, 2015 at 1:34 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>> I agree with these points. The ec2 support is substantially a separate
>>>> project, and would likely be better managed as one. People can much
>>>> more rapidly iterate on it and release it.
>>>>
>>>> I suggest:
>>>>
>>>> 1. Pick a new repo location. amplab/spark-ec2 ? spark-ec2/spark-ec2 ?
>>>> 2. Add interested parties as owners/contributors
>>>> 3. Reassemble a working clone of the current code from spark/ec2 and
>>>> mesos/spark-ec2 and check it in
>>>> 4. Announce the new location on user@, dev@
>>>> 5. Triage open JIRAs to the new repo's issue tracker and close them
>>>> elsewhere
>>>> 6. Remove the old copies of the code and leave a pointer to the new
>>>> location in their place
>>>>
>>>> I'd also like to hear a few more nods before pulling the trigger though.
>>>>
>>>> On Sat, Jul 11, 2015 at 7:07 PM, Matt Goodman <meawo...@gmail.com>
>>>> wrote:
>>>> > I wanted to revive the conversation about the spark-ec2 tools, as it
>>>> > seems
>>>> > to have been lost in the 1.4.1 release voting spree.
>>>> >
>>>> > I think that splitting it into its own repository is a really good
>>>> > move, and
>>>> > I would also be happy to help with this transition, as well as help
>>>> > maintain
>>>> > the resulting repository.  Here is my justification for why we ought
>>>> > to do
>>>> > this split.
>>>> >
>>>> > User Facing:
>>>> >
>>>> > The spark-ec2 launcher dosen't use anything in the parent spark
>>>> > repository
>>>> > spark-ec2 version is disjoint from the parent repo.  I consider it
>>>> > confusing
>>>> > that the spark-ec2 script dosen't launch the version of spark it is
>>>> > checked-out with.
>>>> > Someone interested in setting up spark-ec2 with anything but the
>>>> > default
>>>> > configuration will have to clone at least 2 repositories at present,
>>>> > and
>>>> > probably fork and push changes to 1.
>>>> > spark-ec2 has mismatched dependencies wrt. to spark itself.  This
>>>> > includes a
>>>> > confusing shim in the spark-ec2 script to install boto, which frankly
>>>> > should
>>>> > just be a dependency of the script
>>>> >
>>>> > Developer Facing:
>>>> >
>>>> > Support across 2 repos will be worse than across 1.  Its unclear where
>>>> > to
>>>> > file issues/PRs, and requires extra communications for even fairly
>>>> > trivial
>>>> > stuff.
>>>> > Spark-ec2 also depends on a number binary blobs being in the right
>>>> > place,
>>>> > currently the responsibility for these is decentralized, and likely
>>>> > prone to
>>>> > various flavors of dumb.
>>>> > The current flow of booting a spark-ec2 cluster is _complicated_ I
>>>> > spent the
>>>> > better part of a couple days figuring out how to integrate our custom
>>>> > tools
>>>> > into this stack.  This is very hard to fix when commits/PR's need to
>>>> > span
>>>> > groups/repositories/buckets-o-binary, I am sure there are several
>>>> > other
>>>> > problems that are languishing under similar roadblocks
>>>> > It makes testing possible.  The spark-ec2 script is a great case for
>>>> > CI
>>>> > given the number of permutations of launch criteria there are.  I
>>>> > suspect
>>>> > AWS would be happy to foot the bill on spark-ec2 testing (probably ~20
>>>> > bucks
>>>> > a month based on some envelope sketches), as it is a piece of software
>>>> > that
>>>> > directly impacts other people giving them money.  I have some contacts
>>>> > there, and I am pretty sure this would be an easy conversation,
>>>> > particularly
>>>> > if the repo directly concerned with ec2.  Think also being able to
>>>> > assemble
>>>> > the binary blobs into s3 bucket dedicated to spark-ec2
>>>> >
>>>> > Any other thoughts/voices appreciated here.  spark-ec2 is a
>>>> > super-power tool
>>>> > and deserves a fair bit of attention!
>>>> > --Matthew Goodman
>>>> >
>>>> > =====================
>>>> > Check Out My Website: http://craneium.net
>>>> > Find me on LinkedIn: http://tinyurl.com/d6wlch
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>>
>>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Should spark-ec2 get its own repo?

Reply via email to