Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Michael Malak Sat, 17 May 2014 16:11:26 -0700

While developers may appreciate "1.0 == API stability," I'm not sure that will 
be the understanding of the VP who gives the green light to a Spark-based 
development effort.

I fear a bug that silently produces erroneous results will be perceived like 
the FDIV bug, but in this case without the momentum of an existing large 
installed base and with a number of "competitors" (GridGain, H20, 
Stratosphere). Despite the stated intention of API stability, the perception 
(which becomes the reality) of "1.0" is that it's ready for production use -- 
not bullet-proof, but also not with known silent generation of erroneous 
results. Exceptions and crashes are much more tolerated than silent corruption 
of data. The result may be a reputation of the Spark team unconcerned about 
data integrity.

I ran into (and submitted) https://issues.apache.org/jira/browse/SPARK-1817 due 
to the lack of zipWithIndex(). zip() with a self-created partitioned range was 
the way I was trying to number with IDs a collection of nodes in preparation 
for the GraphX constructor. For the record, it was a frequent Spark committer 
who escalated it to "blocker"; I did not submit it as such. Partitioning a 
Scala range isn't just a toy example; it has a real-life use.

I also wonder about the REPL. Cloudera, for example, touts it as key to making 
Spark a "crossover tool" that Data Scientists can also use. The REPL can be 
considered an API of sorts -- not a traditional Scala or Java API, of course, 
but the "API" that a human data analyst would use. With the Scala REPL 
exhibiting some of the same bad behaviors as the Spark REPL, there is a 
question of whether the Spark REPL can even be fixed. If the Spark REPL has to 
be eliminated after 1.0 due to an inability to repair it, that would constitute 
API instability.

On Saturday, May 17, 2014 2:49 PM, Matei Zaharia <matei.zaha...@gmail.com> 
wrote:

As others have said, the 1.0 milestone is about API stability, not about saying 
“we’ve eliminated all bugs”. The sooner you declare 1.0, the sooner users can 
confidently build on Spark, knowing that the application they build today will 
still run on Spark 1.9.9 three years from now. This is something that I’ve seen 
done badly (and experienced the effects thereof) in other big data projects, 
such as MapReduce and even YARN. The result is that you annoy users, you end up 
with a fragmented userbase where everyone is building against a different 
version, and you drastically slow down development.

With a project as fast-growing as fast-growing as Spark in particular, there 
will be new bugs discovered and reported continuously, especially in the 
non-core components. Look at the graph of # of contributors in time to Spark: 
https://www.ohloh.net/p/apache-spark (bottom-most graph; “commits” changed when 
we started merging each patch as a single commit). This is not slowing down, 
and we need to have the culture now that we treat API stability and release 
numbers at the level expected for a 1.0 project instead of having people come 
in and randomly change the API.

I’ll also note that the issues marked “blocker” were marked so by their 
reporters, since the reporter can set the priority. I don’t consider stuff like 
parallelize() not partitioning ranges in the same way as other collections a 
blocker — it’s a bug, it would be good to fix it, but it only affects a small 
number of use cases. Of course if we find a real blocker (in particular a 
regression from a previous version, or a feature that’s just completely 
broken), we will delay the release for that, but at some point you have to say 
“okay, this fix will go into the next maintenance release”. Maybe we need to 
write a clear policy for what the issue priorities mean.

Finally, I believe it’s much better to have a culture where you can make 
releases on a regular schedule, and have the option to make a maintenance 
release in 3-4 days if you find new bugs, than one where you pile up stuff into 
each release. This is what much large project than us, like Linux, do, and it’s 
the only way to avoid indefinite stalling with a large contributor base. In the 
worst case, if you find a new bug that warrants immediate release, it goes into 
1.0.1 a week after 1.0.0 (we can vote on 1.0.1 in three days with just your bug 
fix in it). And if you find an API that you’d like to improve, just add a new 
one and maybe deprecate the old one — at some point we have to respect our 
users and let them know that code they write today will still run tomorrow.

Matei

On May 17, 2014, at 10:32 AM, Kan Zhang <kzh...@apache.org> wrote:

> +1 on the running commentary here, non-binding of course :-)
> 
> 
> On Sat, May 17, 2014 at 8:44 AM, Andrew Ash <and...@andrewash.com> wrote:
> 
>> +1 on the next release feeling more like a 0.10 than a 1.0
>> On May 17, 2014 4:38 AM, "Mridul Muralidharan" <mri...@gmail.com> wrote:
>> 
>>> I had echoed similar sentiments a while back when there was a discussion
>>> around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api
>>> changes, add missing functionality, go through a hardening release before
>>> 1.0
>>> 
>>> But the community preferred a 1.0 :-)
>>> 
>>> Regards,
>>> Mridul
>>> 
>>> On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com> wrote:
>>>> 
>>>> On this note, non-binding commentary:
>>>> 
>>>> Releases happen in local minima of change, usually created by
>>>> internally enforced code freeze. Spark is incredibly busy now due to
>>>> external factors -- recently a TLP, recently discovered by a large new
>>>> audience, ease of contribution enabled by Github. It's getting like
>>>> the first year of mainstream battle-testing in a month. It's been very
>>>> hard to freeze anything! I see a number of non-trivial issues being
>>>> reported, and I don't think it has been possible to triage all of
>>>> them, even.
>>>> 
>>>> Given the high rate of change, my instinct would have been to release
>>>> 0.10.0 now. But won't it always be very busy? I do think the rate of
>>>> significant issues will slow down.
>>>> 
>>>> Version ain't nothing but a number, but if it has any meaning it's the
>>>> semantic versioning meaning. 1.0 imposes extra handicaps around
>>>> striving to maintain backwards-compatibility. That may end up being
>>>> bent to fit in important changes that are going to be required in this
>>>> continuing period of change. Hadoop does this all the time
>>>> unfortunately and gets away with it, I suppose -- minor version
>>>> releases are really major. (On the other extreme, HBase is at 0.98 and
>>>> quite production-ready.)
>>>> 
>>>> Just consider this a second vote for focus on fixes and 1.0.x rather
>>>> than new features and 1.x. I think there are a few steps that could
>>>> streamline triage of this flood of contributions, and make all of this
>>>> easier, but that's for another thread.
>>>> 
>>>> 
>>>> On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <m...@clearstorydata.com
>>> 
>>> wrote:
>>>>> +1, but just barely.  We've got quite a number of outstanding bugs
>>>>> identified, and many of them have fixes in progress.  I'd hate to see
>>> those
>>>>> efforts get lost in a post-1.0.0 flood of new features targeted at
>>> 1.1.0 --
>>>>> in other words, I'd like to see 1.0.1 retain a high priority relative
>>> to
>>>>> 1.1.0.
>>>>> 
>>>>> Looking through the unresolved JIRAs, it doesn't look like any of the
>>>>> identified bugs are show-stoppers or strictly regressions (although I
>>> will
>>>>> note that one that I have in progress, SPARK-1749, is a bug that we
>>>>> introduced with recent work -- it's not strictly a regression because
>>> we
>>>>> had equally bad but different behavior when the DAGScheduler
>> exceptions
>>>>> weren't previously being handled at all vs. being slightly
>> mis-handled
>>>>> now), so I'm not currently seeing a reason not to release.
>>> 
>>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Reply via email to