Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Mridul Muralidharan Sun, 18 May 2014 00:14:19 -0700

On 18-May-2014 5:05 am, "Mark Hamstra" <m...@clearstorydata.com> wrote:
>
> I don't understand.  We never said that interfaces wouldn't change from
0.9


Agreed.

> to 1.0.  What we are committing to is stability going forward from the
> 1.0.0 baseline.  Nobody is disputing that backward-incompatible behavior
or
> interface changes would be an issue post-1.0.0.  The question is whether

The point is, how confident are we that these are the right set of
interface definitions.
We think it is, but we could also have gone through a 0.10 to vet the
proposed 1.0 changes to stabilize them.

To give examples for which we don't have solutions currently (which we are
facing internally here btw, so not academic exercise) :

- Current spark shuffle model breaks very badly as number of partitions
increases (input and output).

- As number of nodes increase, the overhead per node keeps going up. Spark
currently is more geared towards large memory machines; when the RAM per
node is modest (8 to 16 gig) but large number of them are available, it
does not do too well.

- Current block abstraction breaks as data per block goes beyond 2 gig.

- Cogroup/join when value per key or number of keys (or both) is high
breaks currently.

- Shuffle consolidation is so badly broken it is not funny.

- Currently there is no way of effectively leveraging accelerator
cards/coprocessors/gpus from spark - to do so, I suspect we will need to
redefine OFF_HEAP.

- Effectively leveraging ssd is still an open question IMO when you have
mix of both available.

We have resolved some of these and looking at the rest. These are not
unique to our internal usage profile, I have seen most of these asked
elsewhere too.

Thankfully some of the 1.0 changes actually are geared towards helping to
alleviate some of the above (Iterable change for ex), most of the rest are
internal impl detail of spark core which helps a lot - but there are cases
where this is not so.

Unfortunately I don't know yet if the unresolved/uninvestigated issues will
require more changes or not.

Given this I am very skeptical of expecting current spark interfaces to be
sufficient for next 1 year (forget 3)

I understand this is an argument which can be made to never release 1.0 :-)
Which is why I was ok with a 1.0 instead of 0.10 release in spite of my
preference.

This is a good problem to have IMO ... People are using spark extensively
and in circumstances that we did not envision : necessitating changes even
to spark core.

But the claim that 1.0 interfaces are stable is not something I buy - they
are not, we will need to break them soon and cost of maintaining backward
compatibility will be high.

We just need to make an informed decision to live with that cost, not hand
wave it away.

Regards
Mridul

> there is anything apparent now that is expected to require such disruptive
> changes if we were to commit to the current release candidate as our
> guaranteed 1.0.0 baseline.
>
>
> On Sat, May 17, 2014 at 2:05 PM, Mridul Muralidharan <mri...@gmail.com
>wrote:
>
> > I would make the case for interface stability not just api stability.
> > Particularly given that we have significantly changed some of our
> > interfaces, I want to ensure developers/users are not seeing red flags.
> >
> > Bugs and code stability can be addressed in minor releases if found, but
> > behavioral change and/or interface changes would be a much more invasive
> > issue for our users.
> >
> > Regards
> > Mridul
> > On 18-May-2014 2:19 am, "Matei Zaharia" <matei.zaha...@gmail.com> wrote:
> >
> > > As others have said, the 1.0 milestone is about API stability, not
about
> > > saying “we’ve eliminated all bugs”. The sooner you declare 1.0, the
> > sooner
> > > users can confidently build on Spark, knowing that the application
they
> > > build today will still run on Spark 1.9.9 three years from now. This
is
> > > something that I’ve seen done badly (and experienced the effects
thereof)
> > > in other big data projects, such as MapReduce and even YARN. The
result
> > is
> > > that you annoy users, you end up with a fragmented userbase where
> > everyone
> > > is building against a different version, and you drastically slow down
> > > development.
> > >
> > > With a project as fast-growing as fast-growing as Spark in particular,
> > > there will be new bugs discovered and reported continuously,
especially
> > in
> > > the non-core components. Look at the graph of # of contributors in
time
> > to
> > > Spark: https://www.ohloh.net/p/apache-spark (bottom-most graph;
> > “commits”
> > > changed when we started merging each patch as a single commit). This
is
> > not
> > > slowing down, and we need to have the culture now that we treat API
> > > stability and release numbers at the level expected for a 1.0 project
> > > instead of having people come in and randomly change the API.
> > >
> > > I’ll also note that the issues marked “blocker” were marked so by
their
> > > reporters, since the reporter can set the priority. I don’t consider
> > stuff
> > > like parallelize() not partitioning ranges in the same way as other
> > > collections a blocker — it’s a bug, it would be good to fix it, but it
> > only
> > > affects a small number of use cases. Of course if we find a real
blocker
> > > (in particular a regression from a previous version, or a feature
that’s
> > > just completely broken), we will delay the release for that, but at
some
> > > point you have to say “okay, this fix will go into the next
maintenance
> > > release”. Maybe we need to write a clear policy for what the issue
> > > priorities mean.
> > >
> > > Finally, I believe it’s much better to have a culture where you can
make
> > > releases on a regular schedule, and have the option to make a
maintenance
> > > release in 3-4 days if you find new bugs, than one where you pile up
> > stuff
> > > into each release. This is what much large project than us, like
Linux,
> > do,
> > > and it’s the only way to avoid indefinite stalling with a large
> > contributor
> > > base. In the worst case, if you find a new bug that warrants immediate
> > > release, it goes into 1.0.1 a week after 1.0.0 (we can vote on 1.0.1
in
> > > three days with just your bug fix in it). And if you find an API that
> > you’d
> > > like to improve, just add a new one and maybe deprecate the old one —
at
> > > some point we have to respect our users and let them know that code
they
> > > write today will still run tomorrow.
> > >
> > > Matei
> > >
> > > On May 17, 2014, at 10:32 AM, Kan Zhang <kzh...@apache.org> wrote:
> > >
> > > > +1 on the running commentary here, non-binding of course :-)
> > > >
> > > >
> > > > On Sat, May 17, 2014 at 8:44 AM, Andrew Ash <and...@andrewash.com>
> > > wrote:
> > > >
> > > >> +1 on the next release feeling more like a 0.10 than a 1.0
> > > >> On May 17, 2014 4:38 AM, "Mridul Muralidharan" <mri...@gmail.com>
> > > wrote:
> > > >>
> > > >>> I had echoed similar sentiments a while back when there was a
> > > discussion
> > > >>> around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize
the
> > api
> > > >>> changes, add missing functionality, go through a hardening release
> > > before
> > > >>> 1.0
> > > >>>
> > > >>> But the community preferred a 1.0 :-)
> > > >>>
> > > >>> Regards,
> > > >>> Mridul
> > > >>>
> > > >>> On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com> wrote:
> > > >>>>
> > > >>>> On this note, non-binding commentary:
> > > >>>>
> > > >>>> Releases happen in local minima of change, usually created by
> > > >>>> internally enforced code freeze. Spark is incredibly busy now
due to
> > > >>>> external factors -- recently a TLP, recently discovered by a
large
> > new
> > > >>>> audience, ease of contribution enabled by Github. It's getting
like
> > > >>>> the first year of mainstream battle-testing in a month. It's been
> > very
> > > >>>> hard to freeze anything! I see a number of non-trivial issues
being
> > > >>>> reported, and I don't think it has been possible to triage all of
> > > >>>> them, even.
> > > >>>>
> > > >>>> Given the high rate of change, my instinct would have been to
> > release
> > > >>>> 0.10.0 now. But won't it always be very busy? I do think the
rate of
> > > >>>> significant issues will slow down.
> > > >>>>
> > > >>>> Version ain't nothing but a number, but if it has any meaning
it's
> > the
> > > >>>> semantic versioning meaning. 1.0 imposes extra handicaps around
> > > >>>> striving to maintain backwards-compatibility. That may end up
being
> > > >>>> bent to fit in important changes that are going to be required in
> > this
> > > >>>> continuing period of change. Hadoop does this all the time
> > > >>>> unfortunately and gets away with it, I suppose -- minor version
> > > >>>> releases are really major. (On the other extreme, HBase is at
0.98
> > and
> > > >>>> quite production-ready.)
> > > >>>>
> > > >>>> Just consider this a second vote for focus on fixes and 1.0.x
rather
> > > >>>> than new features and 1.x. I think there are a few steps that
could
> > > >>>> streamline triage of this flood of contributions, and make all of
> > this
> > > >>>> easier, but that's for another thread.
> > > >>>>
> > > >>>>
> > > >>>> On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <
> > > m...@clearstorydata.com
> > > >>>
> > > >>> wrote:
> > > >>>>> +1, but just barely.  We've got quite a number of outstanding
bugs
> > > >>>>> identified, and many of them have fixes in progress.  I'd hate
to
> > see
> > > >>> those
> > > >>>>> efforts get lost in a post-1.0.0 flood of new features targeted
at
> > > >>> 1.1.0 --
> > > >>>>> in other words, I'd like to see 1.0.1 retain a high priority
> > relative
> > > >>> to
> > > >>>>> 1.1.0.
> > > >>>>>
> > > >>>>> Looking through the unresolved JIRAs, it doesn't look like any
of
> > the
> > > >>>>> identified bugs are show-stoppers or strictly regressions
> > (although I
> > > >>> will
> > > >>>>> note that one that I have in progress, SPARK-1749, is a bug
that we
> > > >>>>> introduced with recent work -- it's not strictly a regression
> > because
> > > >>> we
> > > >>>>> had equally bad but different behavior when the DAGScheduler
> > > >> exceptions
> > > >>>>> weren't previously being handled at all vs. being slightly
> > > >> mis-handled
> > > >>>>> now), so I'm not currently seeing a reason not to release.
> > > >>>
> > > >>
> > >
> > >
> >

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Reply via email to