So I think I need to clarify a few things here - particularly since this mail went to the wrong mailing list and a much wider audience than I intended it for :-)
Most of the issues I mentioned are internal implementation detail of spark core : which means, we can enhance them in future without disruption to our userbase (ability to support large number of input/output partitions. Note: this is of order of 100k input and output partitions with uniform spread of keys - very rarely seen outside of some crazy jobs). Some of the issues I mentioned would reqiure DeveloperApi changes - which are not user exposed : they would impact developer use of these api's - which are mostly internally provided by spark. (Like fixing blocks > 2G would require change to Serializer api) A smaller faction might require interface changes - note, I am referring specifically to configuration changes (removing/deprecating some) and possibly newer options to submit/env, etc - I dont envision any programming api change itself. The only api change we did was from Seq -> Iterable - which is actually to address some of the issues I mentioned (join/cogroup). Remaining are bugs which need to be addressed or the feature removed/enhanced like shuffle consolidation. There might be semantic extension of some things like OFF_HEAP storage level to address other computation models - but that would not have an impact on end user - since other options would be pluggable with default set to Tachyon so that there is no user expectation change. So will the interface possibly change ? Sure though we will try to keep it backwardly compatible (as we did with 1.0). Will the api change - other than backward compatible enhancements, probably not. Regards, Mridul On Sun, May 18, 2014 at 12:11 PM, Mridul Muralidharan <[email protected]> wrote: > > On 18-May-2014 5:05 am, "Mark Hamstra" <[email protected]> wrote: >> >> I don't understand. We never said that interfaces wouldn't change from >> 0.9 > > Agreed. > >> to 1.0. What we are committing to is stability going forward from the >> 1.0.0 baseline. Nobody is disputing that backward-incompatible behavior >> or >> interface changes would be an issue post-1.0.0. The question is whether > > The point is, how confident are we that these are the right set of interface > definitions. > We think it is, but we could also have gone through a 0.10 to vet the > proposed 1.0 changes to stabilize them. > > To give examples for which we don't have solutions currently (which we are > facing internally here btw, so not academic exercise) : > > - Current spark shuffle model breaks very badly as number of partitions > increases (input and output). > > - As number of nodes increase, the overhead per node keeps going up. Spark > currently is more geared towards large memory machines; when the RAM per > node is modest (8 to 16 gig) but large number of them are available, it does > not do too well. > > - Current block abstraction breaks as data per block goes beyond 2 gig. > > - Cogroup/join when value per key or number of keys (or both) is high breaks > currently. > > - Shuffle consolidation is so badly broken it is not funny. > > - Currently there is no way of effectively leveraging accelerator > cards/coprocessors/gpus from spark - to do so, I suspect we will need to > redefine OFF_HEAP. > > - Effectively leveraging ssd is still an open question IMO when you have mix > of both available. > > We have resolved some of these and looking at the rest. These are not unique > to our internal usage profile, I have seen most of these asked elsewhere > too. > > Thankfully some of the 1.0 changes actually are geared towards helping to > alleviate some of the above (Iterable change for ex), most of the rest are > internal impl detail of spark core which helps a lot - but there are cases > where this is not so. > > Unfortunately I don't know yet if the unresolved/uninvestigated issues will > require more changes or not. > > Given this I am very skeptical of expecting current spark interfaces to be > sufficient for next 1 year (forget 3) > > I understand this is an argument which can be made to never release 1.0 :-) > Which is why I was ok with a 1.0 instead of 0.10 release in spite of my > preference. > > This is a good problem to have IMO ... People are using spark extensively > and in circumstances that we did not envision : necessitating changes even > to spark core. > > But the claim that 1.0 interfaces are stable is not something I buy - they > are not, we will need to break them soon and cost of maintaining backward > compatibility will be high. > > We just need to make an informed decision to live with that cost, not hand > wave it away. > > Regards > Mridul > >> there is anything apparent now that is expected to require such disruptive >> changes if we were to commit to the current release candidate as our >> guaranteed 1.0.0 baseline. >> >> >> On Sat, May 17, 2014 at 2:05 PM, Mridul Muralidharan >> <[email protected]>wrote: >> >> > I would make the case for interface stability not just api stability. >> > Particularly given that we have significantly changed some of our >> > interfaces, I want to ensure developers/users are not seeing red flags. >> > >> > Bugs and code stability can be addressed in minor releases if found, but >> > behavioral change and/or interface changes would be a much more invasive >> > issue for our users. >> > >> > Regards >> > Mridul >> > On 18-May-2014 2:19 am, "Matei Zaharia" <[email protected]> wrote: >> > >> > > As others have said, the 1.0 milestone is about API stability, not >> > > about >> > > saying “we’ve eliminated all bugs”. The sooner you declare 1.0, the >> > sooner >> > > users can confidently build on Spark, knowing that the application >> > > they >> > > build today will still run on Spark 1.9.9 three years from now. This >> > > is >> > > something that I’ve seen done badly (and experienced the effects >> > > thereof) >> > > in other big data projects, such as MapReduce and even YARN. The >> > > result >> > is >> > > that you annoy users, you end up with a fragmented userbase where >> > everyone >> > > is building against a different version, and you drastically slow down >> > > development. >> > > >> > > With a project as fast-growing as fast-growing as Spark in particular, >> > > there will be new bugs discovered and reported continuously, >> > > especially >> > in >> > > the non-core components. Look at the graph of # of contributors in >> > > time >> > to >> > > Spark: https://www.ohloh.net/p/apache-spark (bottom-most graph; >> > “commits” >> > > changed when we started merging each patch as a single commit). This >> > > is >> > not >> > > slowing down, and we need to have the culture now that we treat API >> > > stability and release numbers at the level expected for a 1.0 project >> > > instead of having people come in and randomly change the API. >> > > >> > > I’ll also note that the issues marked “blocker” were marked so by >> > > their >> > > reporters, since the reporter can set the priority. I don’t consider >> > stuff >> > > like parallelize() not partitioning ranges in the same way as other >> > > collections a blocker — it’s a bug, it would be good to fix it, but it >> > only >> > > affects a small number of use cases. Of course if we find a real >> > > blocker >> > > (in particular a regression from a previous version, or a feature >> > > that’s >> > > just completely broken), we will delay the release for that, but at >> > > some >> > > point you have to say “okay, this fix will go into the next >> > > maintenance >> > > release”. Maybe we need to write a clear policy for what the issue >> > > priorities mean. >> > > >> > > Finally, I believe it’s much better to have a culture where you can >> > > make >> > > releases on a regular schedule, and have the option to make a >> > > maintenance >> > > release in 3-4 days if you find new bugs, than one where you pile up >> > stuff >> > > into each release. This is what much large project than us, like >> > > Linux, >> > do, >> > > and it’s the only way to avoid indefinite stalling with a large >> > contributor >> > > base. In the worst case, if you find a new bug that warrants immediate >> > > release, it goes into 1.0.1 a week after 1.0.0 (we can vote on 1.0.1 >> > > in >> > > three days with just your bug fix in it). And if you find an API that >> > you’d >> > > like to improve, just add a new one and maybe deprecate the old one — >> > > at >> > > some point we have to respect our users and let them know that code >> > > they >> > > write today will still run tomorrow. >> > > >> > > Matei >> > > >> > > On May 17, 2014, at 10:32 AM, Kan Zhang <[email protected]> wrote: >> > > >> > > > +1 on the running commentary here, non-binding of course :-) >> > > > >> > > > >> > > > On Sat, May 17, 2014 at 8:44 AM, Andrew Ash <[email protected]> >> > > wrote: >> > > > >> > > >> +1 on the next release feeling more like a 0.10 than a 1.0 >> > > >> On May 17, 2014 4:38 AM, "Mridul Muralidharan" <[email protected]> >> > > wrote: >> > > >> >> > > >>> I had echoed similar sentiments a while back when there was a >> > > discussion >> > > >>> around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize >> > > >>> the >> > api >> > > >>> changes, add missing functionality, go through a hardening release >> > > before >> > > >>> 1.0 >> > > >>> >> > > >>> But the community preferred a 1.0 :-) >> > > >>> >> > > >>> Regards, >> > > >>> Mridul >> > > >>> >> > > >>> On 17-May-2014 3:19 pm, "Sean Owen" <[email protected]> wrote: >> > > >>>> >> > > >>>> On this note, non-binding commentary: >> > > >>>> >> > > >>>> Releases happen in local minima of change, usually created by >> > > >>>> internally enforced code freeze. Spark is incredibly busy now due >> > > >>>> to >> > > >>>> external factors -- recently a TLP, recently discovered by a >> > > >>>> large >> > new >> > > >>>> audience, ease of contribution enabled by Github. It's getting >> > > >>>> like >> > > >>>> the first year of mainstream battle-testing in a month. It's been >> > very >> > > >>>> hard to freeze anything! I see a number of non-trivial issues >> > > >>>> being >> > > >>>> reported, and I don't think it has been possible to triage all of >> > > >>>> them, even. >> > > >>>> >> > > >>>> Given the high rate of change, my instinct would have been to >> > release >> > > >>>> 0.10.0 now. But won't it always be very busy? I do think the rate >> > > >>>> of >> > > >>>> significant issues will slow down. >> > > >>>> >> > > >>>> Version ain't nothing but a number, but if it has any meaning >> > > >>>> it's >> > the >> > > >>>> semantic versioning meaning. 1.0 imposes extra handicaps around >> > > >>>> striving to maintain backwards-compatibility. That may end up >> > > >>>> being >> > > >>>> bent to fit in important changes that are going to be required in >> > this >> > > >>>> continuing period of change. Hadoop does this all the time >> > > >>>> unfortunately and gets away with it, I suppose -- minor version >> > > >>>> releases are really major. (On the other extreme, HBase is at >> > > >>>> 0.98 >> > and >> > > >>>> quite production-ready.) >> > > >>>> >> > > >>>> Just consider this a second vote for focus on fixes and 1.0.x >> > > >>>> rather >> > > >>>> than new features and 1.x. I think there are a few steps that >> > > >>>> could >> > > >>>> streamline triage of this flood of contributions, and make all of >> > this >> > > >>>> easier, but that's for another thread. >> > > >>>> >> > > >>>> >> > > >>>> On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra < >> > > [email protected] >> > > >>> >> > > >>> wrote: >> > > >>>>> +1, but just barely. We've got quite a number of outstanding >> > > >>>>> bugs >> > > >>>>> identified, and many of them have fixes in progress. I'd hate >> > > >>>>> to >> > see >> > > >>> those >> > > >>>>> efforts get lost in a post-1.0.0 flood of new features targeted >> > > >>>>> at >> > > >>> 1.1.0 -- >> > > >>>>> in other words, I'd like to see 1.0.1 retain a high priority >> > relative >> > > >>> to >> > > >>>>> 1.1.0. >> > > >>>>> >> > > >>>>> Looking through the unresolved JIRAs, it doesn't look like any >> > > >>>>> of >> > the >> > > >>>>> identified bugs are show-stoppers or strictly regressions >> > (although I >> > > >>> will >> > > >>>>> note that one that I have in progress, SPARK-1749, is a bug that >> > > >>>>> we >> > > >>>>> introduced with recent work -- it's not strictly a regression >> > because >> > > >>> we >> > > >>>>> had equally bad but different behavior when the DAGScheduler >> > > >> exceptions >> > > >>>>> weren't previously being handled at all vs. being slightly >> > > >> mis-handled >> > > >>>>> now), so I'm not currently seeing a reason not to release. >> > > >>> >> > > >> >> > > >> > > >> >
