Here's a new draft that incorporated most of the feedback: https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#
I added a specific role for SPIP Author and another one for SPIP Shepherd. On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <[email protected]> wrote: > During the summit, I also had a lot of discussions over similar topics > with multiple Committers and active users. I heard many fantastic ideas. I > believe Spark improvement proposals are good channels to collect the > requirements/designs. > > > IMO, we also need to consider the priority when working on these items. > Even if the proposal is accepted, it does not mean it will be implemented > and merged immediately. It is not a FIFO queue. > > > Even if some PRs are merged, sometimes, we still have to revert them back, > if the design and implementation are not reviewed carefully. We have to > ensure our quality. Spark is not an application software. It is an > infrastructure software that is being used by many many companies. We have > to be very careful in the design and implementation, especially > adding/changing the external APIs. > > > When I developed the Mainframe infrastructure/middleware software in the > past 6 years, I were involved in the discussions with external/internal > customers. The to-do feature list was always above 100. Sometimes, the > customers are feeling frustrated when we are unable to deliver them on time > due to the resource limits and others. Even if they paid us billions, we > still need to do it phase by phase or sometimes they have to accept the > workarounds. That is the reality everyone has to face, I think. > > > Thanks, > > > Xiao Li > > 2017-02-11 7:57 GMT-08:00 Cody Koeninger <[email protected]>: > >> At the spark summit this week, everyone from PMC members to users I had >> never met before were asking me about the Spark improvement proposals >> idea. It's clear that it's a real community need. >> >> But it's been almost half a year, and nothing visible has been done. >> >> Reynold, are you going to do this? >> >> If so, when? >> >> If not, why? >> >> You already did the right thing by including long-deserved committers. >> Please keep doing the right thing for the community. >> >> On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <[email protected]> wrote: >> >>> +1 on all counts (consensus, time bound, define roles) >>> >>> I can update the doc in the next few days and share back. Then maybe we >>> can just officially vote on this. As Tim suggested, we might not get it >>> 100% right the first time and would need to re-iterate. But that's fine. >>> >>> >>> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <[email protected]> >>> wrote: >>> >>>> Hi Cody, >>>> thank you for bringing up this topic, I agree it is very important to >>>> keep a cohesive community around some common, fluid goals. Here are a few >>>> comments about the current document: >>>> >>>> 1. name: it should not overlap with an existing one such as SIP. Can >>>> you imagine someone trying to discuss a scala spore proposal for spark? >>>> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP >>>> sounds great. >>>> >>>> 2. roles: at a high level, SPIPs are meant to reach consensus for >>>> technical decisions with a lasting impact. As such, the template should >>>> emphasize the role of the various parties during this process: >>>> >>>> - the SPIP author is responsible for building consensus. She is the >>>> champion driving the process forward and is responsible for ensuring that >>>> the SPIP follows the general guidelines. The author should be identified in >>>> the SPIP. The authorship of a SPIP can be transferred if the current author >>>> is not interested and someone else wants to move the SPIP forward. There >>>> should probably be 2-3 authors at most for each SPIP. >>>> >>>> - someone with voting power should probably shepherd the SPIP (and be >>>> recorded as such): ensuring that the final decision over the SPIP is >>>> recorded (rejected, accepted, etc.), and advising about the technical >>>> quality of the SPIP: this person need not be a champion for the SPIP or >>>> contribute to it, but rather makes sure it stands a chance of being >>>> approved when the vote happens. Also, if the author cannot find anyone who >>>> would want to take this role, this proposal is likely to be rejected >>>> anyway. >>>> >>>> - users, committers, contributors have the roles already outlined in >>>> the document >>>> >>>> 3. timeline: ideally, once a SPIP has been offered for voting, it >>>> should move swiftly into either being accepted or rejected, so that we do >>>> not end up with a distracting long tail of half-hearted proposals. >>>> >>>> These rules are meant to be flexible, but the current document should >>>> be clear about who is in charge of a SPIP, and the state it is currently >>>> in. >>>> >>>> We have had long discussions over some very important questions such as >>>> approval. I do not have an opinion on these, but why not make a pick and >>>> reevaluate this decision later? This is not a binding process at this >>>> point. >>>> >>>> Tim >>>> >>>> >>>> On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger <[email protected]> >>>> wrote: >>>> >>>>> I don't have a concern about voting vs consensus. >>>>> >>>>> I have a concern that whatever the decision making process is, it is >>>>> explicitly announced on the ticket for the given proposal, with an >>>>> explicit >>>>> deadline, and an explicit outcome. >>>>> >>>>> >>>>> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <[email protected]> >>>>> wrote: >>>>> >>>>>> I'm also in favor of this. Thanks for your persistence Cody. >>>>>> >>>>>> My take on the specific issues Joseph mentioned: >>>>>> >>>>>> 1) voting vs. consensus -- I agree with the argument Ryan Blue made >>>>>> earlier for consensus: >>>>>> >>>>>> > Majority vs consensus: My rationale is that I don't think we want >>>>>> to consider a proposal approved if it had objections serious enough that >>>>>> committers down-voted (or PMC depending on who gets a vote). If these >>>>>> proposals are like PEPs, then they represent a significant amount of >>>>>> community effort and I wouldn't want to move forward if up to half of the >>>>>> community thinks it's an untenable idea. >>>>>> >>>>>> 2) Design doc template -- agree this would be useful, but also seems >>>>>> totally orthogonal to moving forward on the SIP proposal. >>>>>> >>>>>> 3) agree w/ Joseph's proposal for updating the template. >>>>>> >>>>>> One small addition: >>>>>> >>>>>> 4) Deciding on a name -- minor, but I think its wroth disambiguating >>>>>> from Scala's SIPs, and the best proposal I've heard is "SPIP". At >>>>>> least, >>>>>> no one has objected. (don't care enough that I'd object to anything >>>>>> else, >>>>>> though.) >>>>>> >>>>>> >>>>>> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley <[email protected] >>>>>> > wrote: >>>>>> >>>>>>> Hi Cody, >>>>>>> >>>>>>> Thanks for being persistent about this. I too would like to see >>>>>>> this happen. Reviewing the thread, it sounds like the main things >>>>>>> remaining are: >>>>>>> * Decide about a few issues >>>>>>> * Finalize the doc(s) >>>>>>> * Vote on this proposal >>>>>>> >>>>>>> Issues & TODOs: >>>>>>> >>>>>>> (1) The main issue I see above is voting vs. consensus. I have >>>>>>> little preference here. It sounds like something which could be >>>>>>> tailored >>>>>>> based on whether we see too many or too few SIPs being approved. >>>>>>> >>>>>>> (2) Design doc template (This would be great to have for Spark >>>>>>> regardless of this SIP discussion.) >>>>>>> * Reynold, are you still putting this together? >>>>>>> >>>>>>> (3) Template cleanups. Listing some items mentioned above + a new >>>>>>> one w.r.t. Reynold's draft >>>>>>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#> >>>>>>> : >>>>>>> * Reinstate the "Where" section with links to current and past SIPs >>>>>>> * Add field for stating explicit deadlines for approval >>>>>>> * Add field for stating Author & Committer shepherd >>>>>>> >>>>>>> Thanks all! >>>>>>> Joseph >>>>>>> >>>>>>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> I'm bumping this one more time for the new year, and then I'm >>>>>>>> giving up. >>>>>>>> >>>>>>>> Please, fix your process, even if it isn't exactly the way I >>>>>>>> suggested. >>>>>>>> >>>>>>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <[email protected]> >>>>>>>> wrote: >>>>>>>> > On lazy consensus as opposed to voting: >>>>>>>> > >>>>>>>> > First, why lazy consensus? The proposal was for consensus, which >>>>>>>> is at least >>>>>>>> > three +1 votes and no vetos. Consensus has no losing side, it >>>>>>>> requires >>>>>>>> > getting to a point where there is agreement. Isn't that agreement >>>>>>>> what we >>>>>>>> > want to achieve with these proposals? >>>>>>>> > >>>>>>>> > Second, lazy consensus only removes the requirement for three +1 >>>>>>>> votes. Why >>>>>>>> > would we not want at least three committers to think something is >>>>>>>> a good >>>>>>>> > idea before adopting the proposal? >>>>>>>> > >>>>>>>> > rb >>>>>>>> > >>>>>>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger < >>>>>>>> [email protected]> wrote: >>>>>>>> >> >>>>>>>> >> So there are some minor things (the Where section heading >>>>>>>> appears to >>>>>>>> >> be dropped; wherever this document is posted it needs to >>>>>>>> actually link >>>>>>>> >> to a jira filter showing current / past SIPs) but it doesn't >>>>>>>> look like >>>>>>>> >> I can comment on the google doc. >>>>>>>> >> >>>>>>>> >> The major substantive issue that I have is that this version is >>>>>>>> >> significantly less clear as to the outcome of an SIP. >>>>>>>> >> >>>>>>>> >> The apache example of lazy consensus at >>>>>>>> >> http://apache.org/foundation/voting.html#LazyConsensus involves >>>>>>>> an >>>>>>>> >> explicit announcement of an explicit deadline, which I think are >>>>>>>> >> necessary for clarity. >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <[email protected]> >>>>>>>> wrote: >>>>>>>> >> > It turned out suggested edits (trackable) don't show up for >>>>>>>> non-owners, >>>>>>>> >> > so >>>>>>>> >> > I've just merged all the edits in place. It should be visible >>>>>>>> now. >>>>>>>> >> > >>>>>>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin < >>>>>>>> [email protected]> >>>>>>>> >> > wrote: >>>>>>>> >> >> >>>>>>>> >> >> Oops. Let me try figure that out. >>>>>>>> >> >> >>>>>>>> >> >> >>>>>>>> >> >> On Monday, November 7, 2016, Cody Koeninger < >>>>>>>> [email protected]> wrote: >>>>>>>> >> >>> >>>>>>>> >> >>> Thanks for picking up on this. >>>>>>>> >> >>> >>>>>>>> >> >>> Maybe I fail at google docs, but I can't see any edits on >>>>>>>> the document >>>>>>>> >> >>> you linked. >>>>>>>> >> >>> >>>>>>>> >> >>> Regarding lazy consensus, if the board in general has less >>>>>>>> of an issue >>>>>>>> >> >>> with that, sure. As long as it is clearly announced, lasts >>>>>>>> at least >>>>>>>> >> >>> 72 hours, and has a clear outcome. >>>>>>>> >> >>> >>>>>>>> >> >>> The other points are hard to comment on without being able >>>>>>>> to see the >>>>>>>> >> >>> text in question. >>>>>>>> >> >>> >>>>>>>> >> >>> >>>>>>>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin < >>>>>>>> [email protected]> >>>>>>>> >> >>> wrote: >>>>>>>> >> >>> > I just looked through the entire thread again tonight - >>>>>>>> there are a >>>>>>>> >> >>> > lot >>>>>>>> >> >>> > of >>>>>>>> >> >>> > great ideas being discussed. Thanks Cody for taking the >>>>>>>> first crack >>>>>>>> >> >>> > at >>>>>>>> >> >>> > the >>>>>>>> >> >>> > proposal. >>>>>>>> >> >>> > >>>>>>>> >> >>> > I want to first comment on the context. Spark is one of >>>>>>>> the most >>>>>>>> >> >>> > innovative >>>>>>>> >> >>> > and important projects in (big) data -- overall technical >>>>>>>> decisions >>>>>>>> >> >>> > made in >>>>>>>> >> >>> > Apache Spark are sound. But of course, a project as large >>>>>>>> and active >>>>>>>> >> >>> > as >>>>>>>> >> >>> > Spark always have room for improvement, and we as a >>>>>>>> community should >>>>>>>> >> >>> > strive >>>>>>>> >> >>> > to take it to the next level. >>>>>>>> >> >>> > >>>>>>>> >> >>> > To that end, the two biggest areas for improvements in my >>>>>>>> opinion >>>>>>>> >> >>> > are: >>>>>>>> >> >>> > >>>>>>>> >> >>> > 1. Visibility: There are so much happening that it is >>>>>>>> difficult to >>>>>>>> >> >>> > know >>>>>>>> >> >>> > what >>>>>>>> >> >>> > really is going on. For people that don't follow closely, >>>>>>>> it is >>>>>>>> >> >>> > difficult to >>>>>>>> >> >>> > know what the important initiatives are. Even for people >>>>>>>> that do >>>>>>>> >> >>> > follow, it >>>>>>>> >> >>> > is difficult to know what specific things require their >>>>>>>> attention, >>>>>>>> >> >>> > since the >>>>>>>> >> >>> > number of pull requests and JIRA tickets are high and it's >>>>>>>> difficult >>>>>>>> >> >>> > to >>>>>>>> >> >>> > extract signal from noise. >>>>>>>> >> >>> > >>>>>>>> >> >>> > 2. Solicit user (broadly defined, including developers >>>>>>>> themselves) >>>>>>>> >> >>> > input >>>>>>>> >> >>> > more proactively: At the end of the day the project >>>>>>>> provides value >>>>>>>> >> >>> > because >>>>>>>> >> >>> > users use it. Users can't tell us exactly what to build, >>>>>>>> but it is >>>>>>>> >> >>> > important >>>>>>>> >> >>> > to get their inputs. >>>>>>>> >> >>> > >>>>>>>> >> >>> > >>>>>>>> >> >>> > I've taken Cody's doc and edited it: >>>>>>>> >> >>> > >>>>>>>> >> >>> > >>>>>>>> >> >>> > https://docs.google.com/docume >>>>>>>> nt/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#headi >>>>>>>> ng=h.36ut37zh7w2b >>>>>>>> >> >>> > (I've made all my modifications trackable) >>>>>>>> >> >>> > >>>>>>>> >> >>> > There are couple high level changes I made: >>>>>>>> >> >>> > >>>>>>>> >> >>> > 1. I've consulted a board member and he recommended lazy >>>>>>>> consensus >>>>>>>> >> >>> > as >>>>>>>> >> >>> > opposed to voting. The reason being in voting there can >>>>>>>> easily be a >>>>>>>> >> >>> > "loser' >>>>>>>> >> >>> > that gets outvoted. >>>>>>>> >> >>> > >>>>>>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to >>>>>>>> "optional >>>>>>>> >> >>> > design >>>>>>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far >>>>>>>> aside from >>>>>>>> >> >>> > tagging >>>>>>>> >> >>> > things and linking them elsewhere simply having design >>>>>>>> docs and >>>>>>>> >> >>> > prototypes >>>>>>>> >> >>> > implementations in PRs is not something that has not >>>>>>>> worked so far". >>>>>>>> >> >>> > >>>>>>>> >> >>> > 3. I made some the language tweaks to focus more on >>>>>>>> visibility. For >>>>>>>> >> >>> > example, >>>>>>>> >> >>> > "The purpose of an SIP is to inform and involve", rather >>>>>>>> than just >>>>>>>> >> >>> > "involve". SIPs should also have at least two emails that >>>>>>>> go to >>>>>>>> >> >>> > dev@. >>>>>>>> >> >>> > >>>>>>>> >> >>> > >>>>>>>> >> >>> > While I was editing this, I thought we really needed a >>>>>>>> suggested >>>>>>>> >> >>> > template >>>>>>>> >> >>> > for design doc too. I will get to that too ... >>>>>>>> >> >>> > >>>>>>>> >> >>> > >>>>>>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin < >>>>>>>> [email protected]> >>>>>>>> >> >>> > wrote: >>>>>>>> >> >>> >> >>>>>>>> >> >>> >> Most things looked OK to me too, although I do plan to >>>>>>>> take a >>>>>>>> >> >>> >> closer >>>>>>>> >> >>> >> look >>>>>>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1. >>>>>>>> >> >>> >> >>>>>>>> >> >>> >> >>>>>>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin >>>>>>>> >> >>> >> <[email protected]> >>>>>>>> >> >>> >> wrote: >>>>>>>> >> >>> >>> >>>>>>>> >> >>> >>> The proposal looks OK to me. I assume, even though it's >>>>>>>> not >>>>>>>> >> >>> >>> explicitly >>>>>>>> >> >>> >>> called, that voting would happen by e-mail? A template >>>>>>>> for the >>>>>>>> >> >>> >>> proposal document (instead of just a bullet nice) would >>>>>>>> also be >>>>>>>> >> >>> >>> nice, >>>>>>>> >> >>> >>> but that can be done at any time. >>>>>>>> >> >>> >>> >>>>>>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I >>>>>>>> consider a >>>>>>>> >> >>> >>> candidate >>>>>>>> >> >>> >>> for a SIP, given the scope of the work. The document >>>>>>>> attached even >>>>>>>> >> >>> >>> somewhat matches the proposed format. So if anyone wants >>>>>>>> to try >>>>>>>> >> >>> >>> out >>>>>>>> >> >>> >>> the process... >>>>>>>> >> >>> >>> >>>>>>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger >>>>>>>> >> >>> >>> <[email protected]> >>>>>>>> >> >>> >>> wrote: >>>>>>>> >> >>> >>> > Now that spark summit europe is over, are any >>>>>>>> committers >>>>>>>> >> >>> >>> > interested >>>>>>>> >> >>> >>> > in >>>>>>>> >> >>> >>> > moving forward with this? >>>>>>>> >> >>> >>> > >>>>>>>> >> >>> >>> > >>>>>>>> >> >>> >>> > >>>>>>>> >> >>> >>> > >>>>>>>> >> >>> >>> > https://github.com/koeninger/s >>>>>>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md >>>>>>>> >> >>> >>> > >>>>>>>> >> >>> >>> > Or are we going to let this discussion die on the vine? >>>>>>>> >> >>> >>> > >>>>>>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda >>>>>>>> >> >>> >>> > <[email protected]> wrote: >>>>>>>> >> >>> >>> >> Maybe my mail was not clear enough. >>>>>>>> >> >>> >>> >> >>>>>>>> >> >>> >>> >> >>>>>>>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any >>>>>>>> other >>>>>>>> >> >>> >>> >> framework. >>>>>>>> >> >>> >>> >> The >>>>>>>> >> >>> >>> >> idea with benchmarks was to show two things: >>>>>>>> >> >>> >>> >> >>>>>>>> >> >>> >>> >> - why some people are doing bad PR for Spark >>>>>>>> >> >>> >>> >> >>>>>>>> >> >>> >>> >> - how - in easy way - we can change it and show that >>>>>>>> Spark is >>>>>>>> >> >>> >>> >> still on >>>>>>>> >> >>> >>> >> the >>>>>>>> >> >>> >>> >> top >>>>>>>> >> >>> >>> >> >>>>>>>> >> >>> >>> >> >>>>>>>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I >>>>>>>> don't think >>>>>>>> >> >>> >>> >> they're the >>>>>>>> >> >>> >>> >> most important thing in Spark :) On the Spark main >>>>>>>> page there >>>>>>>> >> >>> >>> >> is >>>>>>>> >> >>> >>> >> still >>>>>>>> >> >>> >>> >> chart >>>>>>>> >> >>> >>> >> "Spark vs Hadoop". It is important to show that >>>>>>>> framework is >>>>>>>> >> >>> >>> >> not >>>>>>>> >> >>> >>> >> the >>>>>>>> >> >>> >>> >> same >>>>>>>> >> >>> >>> >> Spark with other API, but much faster and optimized, >>>>>>>> comparable >>>>>>>> >> >>> >>> >> or >>>>>>>> >> >>> >>> >> even >>>>>>>> >> >>> >>> >> faster than other frameworks. >>>>>>>> >> >>> >>> >> >>>>>>>> >> >>> >>> >> >>>>>>>> >> >>> >>> >> About real-time streaming, I think it would be just >>>>>>>> good to see >>>>>>>> >> >>> >>> >> it >>>>>>>> >> >>> >>> >> in >>>>>>>> >> >>> >>> >> Spark. >>>>>>>> >> >>> >>> >> I very like current Spark model, but many voices that >>>>>>>> says "we >>>>>>>> >> >>> >>> >> need >>>>>>>> >> >>> >>> >> more" - >>>>>>>> >> >>> >>> >> community should listen also them and try to help >>>>>>>> them. With >>>>>>>> >> >>> >>> >> SIPs >>>>>>>> >> >>> >>> >> it >>>>>>>> >> >>> >>> >> would >>>>>>>> >> >>> >>> >> be easier, I've just posted this example as "thing >>>>>>>> that may be >>>>>>>> >> >>> >>> >> changed >>>>>>>> >> >>> >>> >> with >>>>>>>> >> >>> >>> >> SIP". >>>>>>>> >> >>> >>> >> >>>>>>>> >> >>> >>> >> >>>>>>>> >> >>> >>> >> I very like unification via Datasets, but there is a >>>>>>>> lot of >>>>>>>> >> >>> >>> >> algorithms >>>>>>>> >> >>> >>> >> inside - let's make easy API, but with strong >>>>>>>> background >>>>>>>> >> >>> >>> >> (articles, >>>>>>>> >> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark >>>>>>>> is still >>>>>>>> >> >>> >>> >> modern >>>>>>>> >> >>> >>> >> framework. >>>>>>>> >> >>> >>> >> >>>>>>>> >> >>> >>> >> >>>>>>>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said >>>>>>>> >> >>> >>> >> organizational >>>>>>>> >> >>> >>> >> ideas >>>>>>>> >> >>> >>> >> were already mentioned and I agree with them, my mail >>>>>>>> was just >>>>>>>> >> >>> >>> >> to >>>>>>>> >> >>> >>> >> show >>>>>>>> >> >>> >>> >> some >>>>>>>> >> >>> >>> >> aspects from my side, so from theside of developer >>>>>>>> and person >>>>>>>> >> >>> >>> >> who >>>>>>>> >> >>> >>> >> is >>>>>>>> >> >>> >>> >> trying >>>>>>>> >> >>> >>> >> to help others with Spark (via StackOverflow or other >>>>>>>> ways) >>>>>>>> >> >>> >>> >> >>>>>>>> >> >>> >>> >> >>>>>>>> >> >>> >>> >> Pozdrawiam / Best regards, >>>>>>>> >> >>> >>> >> >>>>>>>> >> >>> >>> >> Tomasz >>>>>>>> >> >>> >>> >> >>>>>>>> >> >>> >>> >> >>>>>>>> >> >>> >>> >> ________________________________ >>>>>>>> >> >>> >>> >> Od: Cody Koeninger <[email protected]> >>>>>>>> >> >>> >>> >> Wysłane: 17 października 2016 16:46 >>>>>>>> >> >>> >>> >> Do: Debasish Das >>>>>>>> >> >>> >>> >> DW: Tomasz Gawęda; [email protected] >>>>>>>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals >>>>>>>> >> >>> >>> >> >>>>>>>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is >>>>>>>> missing my >>>>>>>> >> >>> >>> >> point. >>>>>>>> >> >>> >>> >> >>>>>>>> >> >>> >>> >> My point is evolve or die. Spark's governance and >>>>>>>> organization >>>>>>>> >> >>> >>> >> is >>>>>>>> >> >>> >>> >> hampering its ability to evolve technologically, and >>>>>>>> it needs >>>>>>>> >> >>> >>> >> to >>>>>>>> >> >>> >>> >> change. >>>>>>>> >> >>> >>> >> >>>>>>>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das >>>>>>>> >> >>> >>> >> <[email protected]> >>>>>>>> >> >>> >>> >> wrote: >>>>>>>> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I picked >>>>>>>> up Spark >>>>>>>> >> >>> >>> >>> in >>>>>>>> >> >>> >>> >>> 2014 >>>>>>>> >> >>> >>> >>> as >>>>>>>> >> >>> >>> >>> soon as I looked into it since compared to writing >>>>>>>> Java >>>>>>>> >> >>> >>> >>> map-reduce >>>>>>>> >> >>> >>> >>> and >>>>>>>> >> >>> >>> >>> Cascading code, Spark made writing distributed code >>>>>>>> fun...But >>>>>>>> >> >>> >>> >>> now >>>>>>>> >> >>> >>> >>> as >>>>>>>> >> >>> >>> >>> we >>>>>>>> >> >>> >>> >>> went >>>>>>>> >> >>> >>> >>> deeper with Spark and real-time streaming use-case >>>>>>>> gets more >>>>>>>> >> >>> >>> >>> prominent, I >>>>>>>> >> >>> >>> >>> think it is time to bring a messaging model in >>>>>>>> conjunction >>>>>>>> >> >>> >>> >>> with >>>>>>>> >> >>> >>> >>> the >>>>>>>> >> >>> >>> >>> batch/micro-batch API that Spark is good >>>>>>>> at....akka-streams >>>>>>>> >> >>> >>> >>> close >>>>>>>> >> >>> >>> >>> integration with spark micro-batching APIs looks >>>>>>>> like a great >>>>>>>> >> >>> >>> >>> direction to >>>>>>>> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0 >>>>>>>> integrated >>>>>>>> >> >>> >>> >>> streaming >>>>>>>> >> >>> >>> >>> with >>>>>>>> >> >>> >>> >>> batch with the assumption is that micro-batching is >>>>>>>> sufficient >>>>>>>> >> >>> >>> >>> to >>>>>>>> >> >>> >>> >>> run >>>>>>>> >> >>> >>> >>> SQL >>>>>>>> >> >>> >>> >>> commands on stream but do we really have time to do >>>>>>>> SQL >>>>>>>> >> >>> >>> >>> processing at >>>>>>>> >> >>> >>> >>> streaming data within 1-2 seconds ? >>>>>>>> >> >>> >>> >>> >>>>>>>> >> >>> >>> >>> After reading the email chain, I started to look >>>>>>>> into Flink >>>>>>>> >> >>> >>> >>> documentation >>>>>>>> >> >>> >>> >>> and if you compare it with Spark documentation, I >>>>>>>> think we >>>>>>>> >> >>> >>> >>> have >>>>>>>> >> >>> >>> >>> major >>>>>>>> >> >>> >>> >>> work >>>>>>>> >> >>> >>> >>> to do detailing out Spark internals so that more >>>>>>>> people from >>>>>>>> >> >>> >>> >>> community >>>>>>>> >> >>> >>> >>> start >>>>>>>> >> >>> >>> >>> to take active role in improving the issues so that >>>>>>>> Spark >>>>>>>> >> >>> >>> >>> stays >>>>>>>> >> >>> >>> >>> strong >>>>>>>> >> >>> >>> >>> compared to Flink. >>>>>>>> >> >>> >>> >>> >>>>>>>> >> >>> >>> >>> >>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl >>>>>>>> uence/display/SPARK/Spark+Internals >>>>>>>> >> >>> >>> >>> >>>>>>>> >> >>> >>> >>> >>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl >>>>>>>> uence/display/FLINK/Flink+Internals >>>>>>>> >> >>> >>> >>> >>>>>>>> >> >>> >>> >>> Spark is no longer an engine that works for >>>>>>>> micro-batch and >>>>>>>> >> >>> >>> >>> batch...We >>>>>>>> >> >>> >>> >>> (and >>>>>>>> >> >>> >>> >>> I am sure many others) are pushing spark as an >>>>>>>> engine for >>>>>>>> >> >>> >>> >>> stream >>>>>>>> >> >>> >>> >>> and >>>>>>>> >> >>> >>> >>> query >>>>>>>> >> >>> >>> >>> processing.....we need to make it a state-of-the-art >>>>>>>> engine >>>>>>>> >> >>> >>> >>> for >>>>>>>> >> >>> >>> >>> high >>>>>>>> >> >>> >>> >>> speed >>>>>>>> >> >>> >>> >>> streaming data and user queries as well ! >>>>>>>> >> >>> >>> >>> >>>>>>>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda >>>>>>>> >> >>> >>> >>> <[email protected]> >>>>>>>> >> >>> >>> >>> wrote: >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>> Hi everyone, >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>> I'm quite late with my answer, but I think my >>>>>>>> suggestions may >>>>>>>> >> >>> >>> >>>> help a >>>>>>>> >> >>> >>> >>>> little bit. :) Many technical and organizational >>>>>>>> topics were >>>>>>>> >> >>> >>> >>>> mentioned, >>>>>>>> >> >>> >>> >>>> but I want to focus on these negative posts about >>>>>>>> Spark and >>>>>>>> >> >>> >>> >>>> about >>>>>>>> >> >>> >>> >>>> "haters" >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very good >>>>>>>> community >>>>>>>> >> >>> >>> >>>> - >>>>>>>> >> >>> >>> >>>> it's >>>>>>>> >> >>> >>> >>>> everything here. But Every project has to "flight" >>>>>>>> on >>>>>>>> >> >>> >>> >>>> "framework >>>>>>>> >> >>> >>> >>>> market" >>>>>>>> >> >>> >>> >>>> to be still no 1. I'm following many Spark and Big >>>>>>>> Data >>>>>>>> >> >>> >>> >>>> communities, >>>>>>>> >> >>> >>> >>>> maybe my mail will inspire someone :) >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have >>>>>>>> enough time >>>>>>>> >> >>> >>> >>>> to >>>>>>>> >> >>> >>> >>>> join >>>>>>>> >> >>> >>> >>>> contributing to Spark) has done excellent job. So >>>>>>>> why are >>>>>>>> >> >>> >>> >>>> some >>>>>>>> >> >>> >>> >>>> people >>>>>>>> >> >>> >>> >>>> saying that Flink (or other framework) is better, >>>>>>>> like it was >>>>>>>> >> >>> >>> >>>> posted >>>>>>>> >> >>> >>> >>>> in >>>>>>>> >> >>> >>> >>>> this mailing list? No, not because that framework >>>>>>>> is better >>>>>>>> >> >>> >>> >>>> in >>>>>>>> >> >>> >>> >>>> all >>>>>>>> >> >>> >>> >>>> cases.. In my opinion, many of these discussions >>>>>>>> where >>>>>>>> >> >>> >>> >>>> started >>>>>>>> >> >>> >>> >>>> after >>>>>>>> >> >>> >>> >>>> Flink marketing-like posts. Please look at >>>>>>>> StackOverflow >>>>>>>> >> >>> >>> >>>> "Flink >>>>>>>> >> >>> >>> >>>> vs >>>>>>>> >> >>> >>> >>>> ...." >>>>>>>> >> >>> >>> >>>> posts, almost every post in "winned" by Flink. >>>>>>>> Answers are >>>>>>>> >> >>> >>> >>>> sometimes >>>>>>>> >> >>> >>> >>>> saying nothing about other frameworks, Flink's >>>>>>>> users (often >>>>>>>> >> >>> >>> >>>> PMC's) >>>>>>>> >> >>> >>> >>>> are >>>>>>>> >> >>> >>> >>>> just posting same information about real-time >>>>>>>> streaming, >>>>>>>> >> >>> >>> >>>> about >>>>>>>> >> >>> >>> >>>> delta >>>>>>>> >> >>> >>> >>>> iterations, etc. It look smart and very often it is >>>>>>>> marked as >>>>>>>> >> >>> >>> >>>> an >>>>>>>> >> >>> >>> >>>> aswer, >>>>>>>> >> >>> >>> >>>> even if - in my opinion - there wasn't told all the >>>>>>>> truth. >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>> My suggestion: I don't have enough money and >>>>>>>> knowledgle to >>>>>>>> >> >>> >>> >>>> perform >>>>>>>> >> >>> >>> >>>> huge >>>>>>>> >> >>> >>> >>>> performance test. Maybe some company, that supports >>>>>>>> Spark >>>>>>>> >> >>> >>> >>>> (Databricks, >>>>>>>> >> >>> >>> >>>> Cloudera? - just saying you're most visible in >>>>>>>> community :) ) >>>>>>>> >> >>> >>> >>>> could >>>>>>>> >> >>> >>> >>>> perform performance test of: >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>> - streaming engine - probably Spark will loose >>>>>>>> because of >>>>>>>> >> >>> >>> >>>> mini-batch >>>>>>>> >> >>> >>> >>>> model, however currently the difference should be >>>>>>>> much lower >>>>>>>> >> >>> >>> >>>> that in >>>>>>>> >> >>> >>> >>>> previous versions >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>> - Machine Learning models >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>> - batch jobs >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>> - Graph jobs >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>> - SQL queries >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>> People will see that Spark is envolving and is also >>>>>>>> a modern >>>>>>>> >> >>> >>> >>>> framework, >>>>>>>> >> >>> >>> >>>> because after reading posts mentioned above people >>>>>>>> may think >>>>>>>> >> >>> >>> >>>> "it >>>>>>>> >> >>> >>> >>>> is >>>>>>>> >> >>> >>> >>>> outdated, future is in framework X". >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about how >>>>>>>> Spark >>>>>>>> >> >>> >>> >>>> Structured >>>>>>>> >> >>> >>> >>>> Streaming beats every other framework in terms of >>>>>>>> easy-of-use >>>>>>>> >> >>> >>> >>>> and >>>>>>>> >> >>> >>> >>>> reliability. Performance tests, done in various >>>>>>>> environments >>>>>>>> >> >>> >>> >>>> (in >>>>>>>> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node >>>>>>>> cluster, >>>>>>>> >> >>> >>> >>>> 20-node >>>>>>>> >> >>> >>> >>>> cluster), could be also very good marketing stuff >>>>>>>> to say >>>>>>>> >> >>> >>> >>>> "hey, >>>>>>>> >> >>> >>> >>>> you're >>>>>>>> >> >>> >>> >>>> telling that you're better, but Spark is still >>>>>>>> faster and is >>>>>>>> >> >>> >>> >>>> still >>>>>>>> >> >>> >>> >>>> getting even more fast!". This would be based on >>>>>>>> facts (just >>>>>>>> >> >>> >>> >>>> numbers), >>>>>>>> >> >>> >>> >>>> not opinions. It would be good for companies, for >>>>>>>> marketing >>>>>>>> >> >>> >>> >>>> puproses >>>>>>>> >> >>> >>> >>>> and >>>>>>>> >> >>> >>> >>>> for every Spark developer >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>> Second: real-time streaming. I've written some time >>>>>>>> ago about >>>>>>>> >> >>> >>> >>>> real-time >>>>>>>> >> >>> >>> >>>> streaming support in Spark Structured Streaming. >>>>>>>> Some work >>>>>>>> >> >>> >>> >>>> should be >>>>>>>> >> >>> >>> >>>> done to make SSS more low-latency, but I think it's >>>>>>>> possible. >>>>>>>> >> >>> >>> >>>> Maybe >>>>>>>> >> >>> >>> >>>> Spark may look at Gearpump, which is also built on >>>>>>>> top of >>>>>>>> >> >>> >>> >>>> Akka? >>>>>>>> >> >>> >>> >>>> I >>>>>>>> >> >>> >>> >>>> don't >>>>>>>> >> >>> >>> >>>> know yet, it is good topic for SIP. However I think >>>>>>>> that >>>>>>>> >> >>> >>> >>>> Spark >>>>>>>> >> >>> >>> >>>> should >>>>>>>> >> >>> >>> >>>> have real-time streaming support. Currently I see >>>>>>>> many >>>>>>>> >> >>> >>> >>>> posts/comments >>>>>>>> >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming >>>>>>>> is doing >>>>>>>> >> >>> >>> >>>> very >>>>>>>> >> >>> >>> >>>> good >>>>>>>> >> >>> >>> >>>> jobs with micro-batches, however I think it is >>>>>>>> possible to >>>>>>>> >> >>> >>> >>>> add >>>>>>>> >> >>> >>> >>>> also >>>>>>>> >> >>> >>> >>>> more >>>>>>>> >> >>> >>> >>>> real-time processing. >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>> Other people said much more and I agree with >>>>>>>> proposal of SIP. >>>>>>>> >> >>> >>> >>>> I'm >>>>>>>> >> >>> >>> >>>> also >>>>>>>> >> >>> >>> >>>> happy that PMC's are not saying that they will not >>>>>>>> listen to >>>>>>>> >> >>> >>> >>>> users, >>>>>>>> >> >>> >>> >>>> but >>>>>>>> >> >>> >>> >>>> they really want to make Spark better for every >>>>>>>> user. >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>> What do you think about these two topics? >>>>>>>> Especially I'm >>>>>>>> >> >>> >>> >>>> looking >>>>>>>> >> >>> >>> >>>> at >>>>>>>> >> >>> >>> >>>> Cody >>>>>>>> >> >>> >>> >>>> (who has started this topic) and PMCs :) >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>> Pozdrawiam / Best regards, >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>> Tomasz >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>> >>>>>>>> >> >>> >>> >>>>>>>> >> >>> >> >>>>>>>> >> >>> > >>>>>>>> >> >>> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> >>>>>>>> >> ------------------------------------------------------------ >>>>>>>> --------- >>>>>>>> >> To unsubscribe e-mail: [email protected] >>>>>>>> >> >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > -- >>>>>>>> > Ryan Blue >>>>>>>> > Software Engineer >>>>>>>> > Netflix >>>>>>>> >>>>>>>> ------------------------------------------------------------ >>>>>>>> --------- >>>>>>>> To unsubscribe e-mail: [email protected] >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Joseph Bradley >>>>>>> >>>>>>> Software Engineer - Machine Learning >>>>>>> >>>>>>> Databricks, Inc. >>>>>>> >>>>>>> [image: http://databricks.com] <http://databricks.com/> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
