Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Mridul Muralidharan Sun, 18 May 2014 02:36:13 -0700

So I think I need to clarify a few things here - particularly since
this mail went to the wrong mailing list and a much wider audience
than I intended it for :-)



Most of the issues I mentioned are internal implementation detail of
spark core : which means, we can enhance them in future without
disruption to our userbase (ability to support large number of
input/output partitions. Note: this is of order of 100k input and
output partitions with uniform spread of keys - very rarely seen
outside of some crazy jobs).

Some of the issues I mentioned would reqiure DeveloperApi changes -
which are not user exposed : they would impact developer use of these
api's - which are mostly internally provided by spark. (Like fixing
blocks > 2G would require change to Serializer api)

A smaller faction might require interface changes - note, I am
referring specifically to configuration changes (removing/deprecating
some) and possibly newer options to submit/env, etc - I dont envision
any programming api change itself.
The only api change we did was from Seq -> Iterable - which is
actually to address some of the issues I mentioned (join/cogroup).

Remaining are bugs which need to be addressed or the feature
removed/enhanced like shuffle consolidation.

There might be semantic extension of some things like OFF_HEAP storage
level to address other computation models - but that would not have an
impact on end user - since other options would be pluggable with
default set to Tachyon so that there is no user expectation change.


So will the interface possibly change ? Sure though we will try to
keep it backwardly compatible (as we did with 1.0).
Will the api change - other than backward compatible enhancements, probably not.


Regards,
Mridul


On Sun, May 18, 2014 at 12:11 PM, Mridul Muralidharan <mri...@gmail.com> wrote:
>
> On 18-May-2014 5:05 am, "Mark Hamstra" <m...@clearstorydata.com> wrote:
>>
>> I don't understand.  We never said that interfaces wouldn't change from
>> 0.9
>
> Agreed.
>
>> to 1.0.  What we are committing to is stability going forward from the
>> 1.0.0 baseline.  Nobody is disputing that backward-incompatible behavior
>> or
>> interface changes would be an issue post-1.0.0.  The question is whether
>
> The point is, how confident are we that these are the right set of interface
> definitions.
> We think it is, but we could also have gone through a 0.10 to vet the
> proposed 1.0 changes to stabilize them.
>
> To give examples for which we don't have solutions currently (which we are
> facing internally here btw, so not academic exercise) :
>
> - Current spark shuffle model breaks very badly as number of partitions
> increases (input and output).
>
> - As number of nodes increase, the overhead per node keeps going up. Spark
> currently is more geared towards large memory machines; when the RAM per
> node is modest (8 to 16 gig) but large number of them are available, it does
> not do too well.
>
> - Current block abstraction breaks as data per block goes beyond 2 gig.
>
> - Cogroup/join when value per key or number of keys (or both) is high breaks
> currently.
>
> - Shuffle consolidation is so badly broken it is not funny.
>
> - Currently there is no way of effectively leveraging accelerator
> cards/coprocessors/gpus from spark - to do so, I suspect we will need to
> redefine OFF_HEAP.
>
> - Effectively leveraging ssd is still an open question IMO when you have mix
> of both available.
>
> We have resolved some of these and looking at the rest. These are not unique
> to our internal usage profile, I have seen most of these asked elsewhere
> too.
>
> Thankfully some of the 1.0 changes actually are geared towards helping to
> alleviate some of the above (Iterable change for ex), most of the rest are
> internal impl detail of spark core which helps a lot - but there are cases
> where this is not so.
>
> Unfortunately I don't know yet if the unresolved/uninvestigated issues will
> require more changes or not.
>
> Given this I am very skeptical of expecting current spark interfaces to be
> sufficient for next 1 year (forget 3)
>
> I understand this is an argument which can be made to never release 1.0 :-)
> Which is why I was ok with a 1.0 instead of 0.10 release in spite of my
> preference.
>
> This is a good problem to have IMO ... People are using spark extensively
> and in circumstances that we did not envision : necessitating changes even
> to spark core.
>
> But the claim that 1.0 interfaces are stable is not something I buy - they
> are not, we will need to break them soon and cost of maintaining backward
> compatibility will be high.
>
> We just need to make an informed decision to live with that cost, not hand
> wave it away.
>
> Regards
> Mridul
>
>> there is anything apparent now that is expected to require such disruptive
>> changes if we were to commit to the current release candidate as our
>> guaranteed 1.0.0 baseline.
>>
>>
>> On Sat, May 17, 2014 at 2:05 PM, Mridul Muralidharan
>> <mri...@gmail.com>wrote:
>>
>> > I would make the case for interface stability not just api stability.
>> > Particularly given that we have significantly changed some of our
>> > interfaces, I want to ensure developers/users are not seeing red flags.
>> >
>> > Bugs and code stability can be addressed in minor releases if found, but
>> > behavioral change and/or interface changes would be a much more invasive
>> > issue for our users.
>> >
>> > Regards
>> > Mridul
>> > On 18-May-2014 2:19 am, "Matei Zaharia" <matei.zaha...@gmail.com> wrote:
>> >
>> > > As others have said, the 1.0 milestone is about API stability, not
>> > > about
>> > > saying “we’ve eliminated all bugs”. The sooner you declare 1.0, the
>> > sooner
>> > > users can confidently build on Spark, knowing that the application
>> > > they
>> > > build today will still run on Spark 1.9.9 three years from now. This
>> > > is
>> > > something that I’ve seen done badly (and experienced the effects
>> > > thereof)
>> > > in other big data projects, such as MapReduce and even YARN. The
>> > > result
>> > is
>> > > that you annoy users, you end up with a fragmented userbase where
>> > everyone
>> > > is building against a different version, and you drastically slow down
>> > > development.
>> > >
>> > > With a project as fast-growing as fast-growing as Spark in particular,
>> > > there will be new bugs discovered and reported continuously,
>> > > especially
>> > in
>> > > the non-core components. Look at the graph of # of contributors in
>> > > time
>> > to
>> > > Spark: https://www.ohloh.net/p/apache-spark (bottom-most graph;
>> > “commits”
>> > > changed when we started merging each patch as a single commit). This
>> > > is
>> > not
>> > > slowing down, and we need to have the culture now that we treat API
>> > > stability and release numbers at the level expected for a 1.0 project
>> > > instead of having people come in and randomly change the API.
>> > >
>> > > I’ll also note that the issues marked “blocker” were marked so by
>> > > their
>> > > reporters, since the reporter can set the priority. I don’t consider
>> > stuff
>> > > like parallelize() not partitioning ranges in the same way as other
>> > > collections a blocker — it’s a bug, it would be good to fix it, but it
>> > only
>> > > affects a small number of use cases. Of course if we find a real
>> > > blocker
>> > > (in particular a regression from a previous version, or a feature
>> > > that’s
>> > > just completely broken), we will delay the release for that, but at
>> > > some
>> > > point you have to say “okay, this fix will go into the next
>> > > maintenance
>> > > release”. Maybe we need to write a clear policy for what the issue
>> > > priorities mean.
>> > >
>> > > Finally, I believe it’s much better to have a culture where you can
>> > > make
>> > > releases on a regular schedule, and have the option to make a
>> > > maintenance
>> > > release in 3-4 days if you find new bugs, than one where you pile up
>> > stuff
>> > > into each release. This is what much large project than us, like
>> > > Linux,
>> > do,
>> > > and it’s the only way to avoid indefinite stalling with a large
>> > contributor
>> > > base. In the worst case, if you find a new bug that warrants immediate
>> > > release, it goes into 1.0.1 a week after 1.0.0 (we can vote on 1.0.1
>> > > in
>> > > three days with just your bug fix in it). And if you find an API that
>> > you’d
>> > > like to improve, just add a new one and maybe deprecate the old one —
>> > > at
>> > > some point we have to respect our users and let them know that code
>> > > they
>> > > write today will still run tomorrow.
>> > >
>> > > Matei
>> > >
>> > > On May 17, 2014, at 10:32 AM, Kan Zhang <kzh...@apache.org> wrote:
>> > >
>> > > > +1 on the running commentary here, non-binding of course :-)
>> > > >
>> > > >
>> > > > On Sat, May 17, 2014 at 8:44 AM, Andrew Ash <and...@andrewash.com>
>> > > wrote:
>> > > >
>> > > >> +1 on the next release feeling more like a 0.10 than a 1.0
>> > > >> On May 17, 2014 4:38 AM, "Mridul Muralidharan" <mri...@gmail.com>
>> > > wrote:
>> > > >>
>> > > >>> I had echoed similar sentiments a while back when there was a
>> > > discussion
>> > > >>> around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize
>> > > >>> the
>> > api
>> > > >>> changes, add missing functionality, go through a hardening release
>> > > before
>> > > >>> 1.0
>> > > >>>
>> > > >>> But the community preferred a 1.0 :-)
>> > > >>>
>> > > >>> Regards,
>> > > >>> Mridul
>> > > >>>
>> > > >>> On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com> wrote:
>> > > >>>>
>> > > >>>> On this note, non-binding commentary:
>> > > >>>>
>> > > >>>> Releases happen in local minima of change, usually created by
>> > > >>>> internally enforced code freeze. Spark is incredibly busy now due
>> > > >>>> to
>> > > >>>> external factors -- recently a TLP, recently discovered by a
>> > > >>>> large
>> > new
>> > > >>>> audience, ease of contribution enabled by Github. It's getting
>> > > >>>> like
>> > > >>>> the first year of mainstream battle-testing in a month. It's been
>> > very
>> > > >>>> hard to freeze anything! I see a number of non-trivial issues
>> > > >>>> being
>> > > >>>> reported, and I don't think it has been possible to triage all of
>> > > >>>> them, even.
>> > > >>>>
>> > > >>>> Given the high rate of change, my instinct would have been to
>> > release
>> > > >>>> 0.10.0 now. But won't it always be very busy? I do think the rate
>> > > >>>> of
>> > > >>>> significant issues will slow down.
>> > > >>>>
>> > > >>>> Version ain't nothing but a number, but if it has any meaning
>> > > >>>> it's
>> > the
>> > > >>>> semantic versioning meaning. 1.0 imposes extra handicaps around
>> > > >>>> striving to maintain backwards-compatibility. That may end up
>> > > >>>> being
>> > > >>>> bent to fit in important changes that are going to be required in
>> > this
>> > > >>>> continuing period of change. Hadoop does this all the time
>> > > >>>> unfortunately and gets away with it, I suppose -- minor version
>> > > >>>> releases are really major. (On the other extreme, HBase is at
>> > > >>>> 0.98
>> > and
>> > > >>>> quite production-ready.)
>> > > >>>>
>> > > >>>> Just consider this a second vote for focus on fixes and 1.0.x
>> > > >>>> rather
>> > > >>>> than new features and 1.x. I think there are a few steps that
>> > > >>>> could
>> > > >>>> streamline triage of this flood of contributions, and make all of
>> > this
>> > > >>>> easier, but that's for another thread.
>> > > >>>>
>> > > >>>>
>> > > >>>> On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <
>> > > m...@clearstorydata.com
>> > > >>>
>> > > >>> wrote:
>> > > >>>>> +1, but just barely.  We've got quite a number of outstanding
>> > > >>>>> bugs
>> > > >>>>> identified, and many of them have fixes in progress.  I'd hate
>> > > >>>>> to
>> > see
>> > > >>> those
>> > > >>>>> efforts get lost in a post-1.0.0 flood of new features targeted
>> > > >>>>> at
>> > > >>> 1.1.0 --
>> > > >>>>> in other words, I'd like to see 1.0.1 retain a high priority
>> > relative
>> > > >>> to
>> > > >>>>> 1.1.0.
>> > > >>>>>
>> > > >>>>> Looking through the unresolved JIRAs, it doesn't look like any
>> > > >>>>> of
>> > the
>> > > >>>>> identified bugs are show-stoppers or strictly regressions
>> > (although I
>> > > >>> will
>> > > >>>>> note that one that I have in progress, SPARK-1749, is a bug that
>> > > >>>>> we
>> > > >>>>> introduced with recent work -- it's not strictly a regression
>> > because
>> > > >>> we
>> > > >>>>> had equally bad but different behavior when the DAGScheduler
>> > > >> exceptions
>> > > >>>>> weren't previously being handled at all vs. being slightly
>> > > >> mis-handled
>> > > >>>>> now), so I'm not currently seeing a reason not to release.
>> > > >>>
>> > > >>
>> > >
>> > >
>> >

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Reply via email to