Re: Spark Improvement Proposals

Reynold Xin Mon, 13 Feb 2017 07:02:57 -0800

Here's a new draft that incorporated most of the feedback:
https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#


I added a specific role for SPIP Author and another one for SPIP Shepherd.

On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <[email protected]> wrote:

> During the summit, I also had a lot of discussions over similar topics
> with multiple Committers and active users. I heard many fantastic ideas. I
> believe Spark improvement proposals are good channels to collect the
> requirements/designs.
>
>
> IMO, we also need to consider the priority when working on these items.
> Even if the proposal is accepted, it does not mean it will be implemented
> and merged immediately. It is not a FIFO queue.
>
>
> Even if some PRs are merged, sometimes, we still have to revert them back,
> if the design and implementation are not reviewed carefully. We have to
> ensure our quality. Spark is not an application software. It is an
> infrastructure software that is being used by many many companies. We have
> to be very careful in the design and implementation, especially
> adding/changing the external APIs.
>
>
> When I developed the Mainframe infrastructure/middleware software in the
> past 6 years, I were involved in the discussions with external/internal
> customers. The to-do feature list was always above 100. Sometimes, the
> customers are feeling frustrated when we are unable to deliver them on time
> due to the resource limits and others. Even if they paid us billions, we
> still need to do it phase by phase or sometimes they have to accept the
> workarounds. That is the reality everyone has to face, I think.
>
>
> Thanks,
>
>
> Xiao Li
>
> 2017-02-11 7:57 GMT-08:00 Cody Koeninger <[email protected]>:
>
>> At the spark summit this week, everyone from PMC members to users I had
>> never met before were asking me about the Spark improvement proposals
>> idea.  It's clear that it's a real community need.
>>
>> But it's been almost half a year, and nothing visible has been done.
>>
>> Reynold, are you going to do this?
>>
>> If so, when?
>>
>> If not, why?
>>
>> You already did the right thing by including long-deserved committers.
>> Please keep doing the right thing for the community.
>>
>> On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <[email protected]> wrote:
>>
>>> +1 on all counts (consensus, time bound, define roles)
>>>
>>> I can update the doc in the next few days and share back. Then maybe we
>>> can just officially vote on this. As Tim suggested, we might not get it
>>> 100% right the first time and would need to re-iterate. But that's fine.
>>>
>>>
>>> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <[email protected]>
>>> wrote:
>>>
>>>> Hi Cody,
>>>> thank you for bringing up this topic, I agree it is very important to
>>>> keep a cohesive community around some common, fluid goals. Here are a few
>>>> comments about the current document:
>>>>
>>>> 1. name: it should not overlap with an existing one such as SIP. Can
>>>> you imagine someone trying to discuss a scala spore proposal for spark?
>>>> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
>>>> sounds great.
>>>>
>>>> 2. roles: at a high level, SPIPs are meant to reach consensus for
>>>> technical decisions with a lasting impact. As such, the template should
>>>> emphasize the role of the various parties during this process:
>>>>
>>>>  - the SPIP author is responsible for building consensus. She is the
>>>> champion driving the process forward and is responsible for ensuring that
>>>> the SPIP follows the general guidelines. The author should be identified in
>>>> the SPIP. The authorship of a SPIP can be transferred if the current author
>>>> is not interested and someone else wants to move the SPIP forward. There
>>>> should probably be 2-3 authors at most for each SPIP.
>>>>
>>>>  - someone with voting power should probably shepherd the SPIP (and be
>>>> recorded as such): ensuring that the final decision over the SPIP is
>>>> recorded (rejected, accepted, etc.), and advising about the technical
>>>> quality of the SPIP: this person need not be a champion for the SPIP or
>>>> contribute to it, but rather makes sure it stands a chance of being
>>>> approved when the vote happens. Also, if the author cannot find anyone who
>>>> would want to take this role, this proposal is likely to be rejected 
>>>> anyway.
>>>>
>>>>  - users, committers, contributors have the roles already outlined in
>>>> the document
>>>>
>>>> 3. timeline: ideally, once a SPIP has been offered for voting, it
>>>> should move swiftly into either being accepted or rejected, so that we do
>>>> not end up with a distracting long tail of half-hearted proposals.
>>>>
>>>> These rules are meant to be flexible, but the current document should
>>>> be clear about who is in charge of a SPIP, and the state it is currently 
>>>> in.
>>>>
>>>> We have had long discussions over some very important questions such as
>>>> approval. I do not have an opinion on these, but why not make a pick and
>>>> reevaluate this decision later? This is not a binding process at this 
>>>> point.
>>>>
>>>> Tim
>>>>
>>>>
>>>> On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger <[email protected]>
>>>> wrote:
>>>>
>>>>> I don't have a concern about voting vs consensus.
>>>>>
>>>>> I have a concern that whatever the decision making process is, it is
>>>>> explicitly announced on the ticket for the given proposal, with an 
>>>>> explicit
>>>>> deadline, and an explicit outcome.
>>>>>
>>>>>
>>>>> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I'm also in favor of this.  Thanks for your persistence Cody.
>>>>>>
>>>>>> My take on the specific issues Joseph mentioned:
>>>>>>
>>>>>> 1) voting vs. consensus -- I agree with the argument Ryan Blue made
>>>>>> earlier for consensus:
>>>>>>
>>>>>> > Majority vs consensus: My rationale is that I don't think we want
>>>>>> to consider a proposal approved if it had objections serious enough that
>>>>>> committers down-voted (or PMC depending on who gets a vote). If these
>>>>>> proposals are like PEPs, then they represent a significant amount of
>>>>>> community effort and I wouldn't want to move forward if up to half of the
>>>>>> community thinks it's an untenable idea.
>>>>>>
>>>>>> 2) Design doc template -- agree this would be useful, but also seems
>>>>>> totally orthogonal to moving forward on the SIP proposal.
>>>>>>
>>>>>> 3) agree w/ Joseph's proposal for updating the template.
>>>>>>
>>>>>> One small addition:
>>>>>>
>>>>>> 4) Deciding on a name -- minor, but I think its wroth disambiguating
>>>>>> from Scala's SIPs, and the best proposal I've heard is "SPIP".   At 
>>>>>> least,
>>>>>> no one has objected.  (don't care enough that I'd object to anything 
>>>>>> else,
>>>>>> though.)
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley <[email protected]
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi Cody,
>>>>>>>
>>>>>>> Thanks for being persistent about this.  I too would like to see
>>>>>>> this happen.  Reviewing the thread, it sounds like the main things
>>>>>>> remaining are:
>>>>>>> * Decide about a few issues
>>>>>>> * Finalize the doc(s)
>>>>>>> * Vote on this proposal
>>>>>>>
>>>>>>> Issues & TODOs:
>>>>>>>
>>>>>>> (1) The main issue I see above is voting vs. consensus.  I have
>>>>>>> little preference here.  It sounds like something which could be 
>>>>>>> tailored
>>>>>>> based on whether we see too many or too few SIPs being approved.
>>>>>>>
>>>>>>> (2) Design doc template  (This would be great to have for Spark
>>>>>>> regardless of this SIP discussion.)
>>>>>>> * Reynold, are you still putting this together?
>>>>>>>
>>>>>>> (3) Template cleanups.  Listing some items mentioned above + a new
>>>>>>> one w.r.t. Reynold's draft
>>>>>>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#>
>>>>>>> :
>>>>>>> * Reinstate the "Where" section with links to current and past SIPs
>>>>>>> * Add field for stating explicit deadlines for approval
>>>>>>> * Add field for stating Author & Committer shepherd
>>>>>>>
>>>>>>> Thanks all!
>>>>>>> Joseph
>>>>>>>
>>>>>>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I'm bumping this one more time for the new year, and then I'm
>>>>>>>> giving up.
>>>>>>>>
>>>>>>>> Please, fix your process, even if it isn't exactly the way I
>>>>>>>> suggested.
>>>>>>>>
>>>>>>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <[email protected]>
>>>>>>>> wrote:
>>>>>>>> > On lazy consensus as opposed to voting:
>>>>>>>> >
>>>>>>>> > First, why lazy consensus? The proposal was for consensus, which
>>>>>>>> is at least
>>>>>>>> > three +1 votes and no vetos. Consensus has no losing side, it
>>>>>>>> requires
>>>>>>>> > getting to a point where there is agreement. Isn't that agreement
>>>>>>>> what we
>>>>>>>> > want to achieve with these proposals?
>>>>>>>> >
>>>>>>>> > Second, lazy consensus only removes the requirement for three +1
>>>>>>>> votes. Why
>>>>>>>> > would we not want at least three committers to think something is
>>>>>>>> a good
>>>>>>>> > idea before adopting the proposal?
>>>>>>>> >
>>>>>>>> > rb
>>>>>>>> >
>>>>>>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <
>>>>>>>> [email protected]> wrote:
>>>>>>>> >>
>>>>>>>> >> So there are some minor things (the Where section heading
>>>>>>>> appears to
>>>>>>>> >> be dropped; wherever this document is posted it needs to
>>>>>>>> actually link
>>>>>>>> >> to a jira filter showing current / past SIPs) but it doesn't
>>>>>>>> look like
>>>>>>>> >> I can comment on the google doc.
>>>>>>>> >>
>>>>>>>> >> The major substantive issue that I have is that this version is
>>>>>>>> >> significantly less clear as to the outcome of an SIP.
>>>>>>>> >>
>>>>>>>> >> The apache example of lazy consensus at
>>>>>>>> >> http://apache.org/foundation/voting.html#LazyConsensus involves
>>>>>>>> an
>>>>>>>> >> explicit announcement of an explicit deadline, which I think are
>>>>>>>> >> necessary for clarity.
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <[email protected]>
>>>>>>>> wrote:
>>>>>>>> >> > It turned out suggested edits (trackable) don't show up for
>>>>>>>> non-owners,
>>>>>>>> >> > so
>>>>>>>> >> > I've just merged all the edits in place. It should be visible
>>>>>>>> now.
>>>>>>>> >> >
>>>>>>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <
>>>>>>>> [email protected]>
>>>>>>>> >> > wrote:
>>>>>>>> >> >>
>>>>>>>> >> >> Oops. Let me try figure that out.
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> On Monday, November 7, 2016, Cody Koeninger <
>>>>>>>> [email protected]> wrote:
>>>>>>>> >> >>>
>>>>>>>> >> >>> Thanks for picking up on this.
>>>>>>>> >> >>>
>>>>>>>> >> >>> Maybe I fail at google docs, but I can't see any edits on
>>>>>>>> the document
>>>>>>>> >> >>> you linked.
>>>>>>>> >> >>>
>>>>>>>> >> >>> Regarding lazy consensus, if the board in general has less
>>>>>>>> of an issue
>>>>>>>> >> >>> with that, sure.  As long as it is clearly announced, lasts
>>>>>>>> at least
>>>>>>>> >> >>> 72 hours, and has a clear outcome.
>>>>>>>> >> >>>
>>>>>>>> >> >>> The other points are hard to comment on without being able
>>>>>>>> to see the
>>>>>>>> >> >>> text in question.
>>>>>>>> >> >>>
>>>>>>>> >> >>>
>>>>>>>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <
>>>>>>>> [email protected]>
>>>>>>>> >> >>> wrote:
>>>>>>>> >> >>> > I just looked through the entire thread again tonight -
>>>>>>>> there are a
>>>>>>>> >> >>> > lot
>>>>>>>> >> >>> > of
>>>>>>>> >> >>> > great ideas being discussed. Thanks Cody for taking the
>>>>>>>> first crack
>>>>>>>> >> >>> > at
>>>>>>>> >> >>> > the
>>>>>>>> >> >>> > proposal.
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > I want to first comment on the context. Spark is one of
>>>>>>>> the most
>>>>>>>> >> >>> > innovative
>>>>>>>> >> >>> > and important projects in (big) data -- overall technical
>>>>>>>> decisions
>>>>>>>> >> >>> > made in
>>>>>>>> >> >>> > Apache Spark are sound. But of course, a project as large
>>>>>>>> and active
>>>>>>>> >> >>> > as
>>>>>>>> >> >>> > Spark always have room for improvement, and we as a
>>>>>>>> community should
>>>>>>>> >> >>> > strive
>>>>>>>> >> >>> > to take it to the next level.
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > To that end, the two biggest areas for improvements in my
>>>>>>>> opinion
>>>>>>>> >> >>> > are:
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > 1. Visibility: There are so much happening that it is
>>>>>>>> difficult to
>>>>>>>> >> >>> > know
>>>>>>>> >> >>> > what
>>>>>>>> >> >>> > really is going on. For people that don't follow closely,
>>>>>>>> it is
>>>>>>>> >> >>> > difficult to
>>>>>>>> >> >>> > know what the important initiatives are. Even for people
>>>>>>>> that do
>>>>>>>> >> >>> > follow, it
>>>>>>>> >> >>> > is difficult to know what specific things require their
>>>>>>>> attention,
>>>>>>>> >> >>> > since the
>>>>>>>> >> >>> > number of pull requests and JIRA tickets are high and it's
>>>>>>>> difficult
>>>>>>>> >> >>> > to
>>>>>>>> >> >>> > extract signal from noise.
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > 2. Solicit user (broadly defined, including developers
>>>>>>>> themselves)
>>>>>>>> >> >>> > input
>>>>>>>> >> >>> > more proactively: At the end of the day the project
>>>>>>>> provides value
>>>>>>>> >> >>> > because
>>>>>>>> >> >>> > users use it. Users can't tell us exactly what to build,
>>>>>>>> but it is
>>>>>>>> >> >>> > important
>>>>>>>> >> >>> > to get their inputs.
>>>>>>>> >> >>> >
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > I've taken Cody's doc and edited it:
>>>>>>>> >> >>> >
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > https://docs.google.com/docume
>>>>>>>> nt/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#headi
>>>>>>>> ng=h.36ut37zh7w2b
>>>>>>>> >> >>> > (I've made all my modifications trackable)
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > There are couple high level changes I made:
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > 1. I've consulted a board member and he recommended lazy
>>>>>>>> consensus
>>>>>>>> >> >>> > as
>>>>>>>> >> >>> > opposed to voting. The reason being in voting there can
>>>>>>>> easily be a
>>>>>>>> >> >>> > "loser'
>>>>>>>> >> >>> > that gets outvoted.
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to
>>>>>>>> "optional
>>>>>>>> >> >>> > design
>>>>>>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far
>>>>>>>> aside from
>>>>>>>> >> >>> > tagging
>>>>>>>> >> >>> > things and linking them elsewhere simply having design
>>>>>>>> docs and
>>>>>>>> >> >>> > prototypes
>>>>>>>> >> >>> > implementations in PRs is not something that has not
>>>>>>>> worked so far".
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > 3. I made some the language tweaks to focus more on
>>>>>>>> visibility. For
>>>>>>>> >> >>> > example,
>>>>>>>> >> >>> > "The purpose of an SIP is to inform and involve", rather
>>>>>>>> than just
>>>>>>>> >> >>> > "involve". SIPs should also have at least two emails that
>>>>>>>> go to
>>>>>>>> >> >>> > dev@.
>>>>>>>> >> >>> >
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > While I was editing this, I thought we really needed a
>>>>>>>> suggested
>>>>>>>> >> >>> > template
>>>>>>>> >> >>> > for design doc too. I will get to that too ...
>>>>>>>> >> >>> >
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <
>>>>>>>> [email protected]>
>>>>>>>> >> >>> > wrote:
>>>>>>>> >> >>> >>
>>>>>>>> >> >>> >> Most things looked OK to me too, although I do plan to
>>>>>>>> take a
>>>>>>>> >> >>> >> closer
>>>>>>>> >> >>> >> look
>>>>>>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1.
>>>>>>>> >> >>> >>
>>>>>>>> >> >>> >>
>>>>>>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>>>>>>>> >> >>> >> <[email protected]>
>>>>>>>> >> >>> >> wrote:
>>>>>>>> >> >>> >>>
>>>>>>>> >> >>> >>> The proposal looks OK to me. I assume, even though it's
>>>>>>>> not
>>>>>>>> >> >>> >>> explicitly
>>>>>>>> >> >>> >>> called, that voting would happen by e-mail? A template
>>>>>>>> for the
>>>>>>>> >> >>> >>> proposal document (instead of just a bullet nice) would
>>>>>>>> also be
>>>>>>>> >> >>> >>> nice,
>>>>>>>> >> >>> >>> but that can be done at any time.
>>>>>>>> >> >>> >>>
>>>>>>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I
>>>>>>>> consider a
>>>>>>>> >> >>> >>> candidate
>>>>>>>> >> >>> >>> for a SIP, given the scope of the work. The document
>>>>>>>> attached even
>>>>>>>> >> >>> >>> somewhat matches the proposed format. So if anyone wants
>>>>>>>> to try
>>>>>>>> >> >>> >>> out
>>>>>>>> >> >>> >>> the process...
>>>>>>>> >> >>> >>>
>>>>>>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>>>>>>>> >> >>> >>> <[email protected]>
>>>>>>>> >> >>> >>> wrote:
>>>>>>>> >> >>> >>> > Now that spark summit europe is over, are any
>>>>>>>> committers
>>>>>>>> >> >>> >>> > interested
>>>>>>>> >> >>> >>> > in
>>>>>>>> >> >>> >>> > moving forward with this?
>>>>>>>> >> >>> >>> >
>>>>>>>> >> >>> >>> >
>>>>>>>> >> >>> >>> >
>>>>>>>> >> >>> >>> >
>>>>>>>> >> >>> >>> > https://github.com/koeninger/s
>>>>>>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>>>>>> >> >>> >>> >
>>>>>>>> >> >>> >>> > Or are we going to let this discussion die on the vine?
>>>>>>>> >> >>> >>> >
>>>>>>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>>>>>>> >> >>> >>> > <[email protected]> wrote:
>>>>>>>> >> >>> >>> >> Maybe my mail was not clear enough.
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any
>>>>>>>> other
>>>>>>>> >> >>> >>> >> framework.
>>>>>>>> >> >>> >>> >> The
>>>>>>>> >> >>> >>> >> idea with benchmarks was to show two things:
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> - why some people are doing bad PR for Spark
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> - how - in easy way - we can change it and show that
>>>>>>>> Spark is
>>>>>>>> >> >>> >>> >> still on
>>>>>>>> >> >>> >>> >> the
>>>>>>>> >> >>> >>> >> top
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I
>>>>>>>> don't think
>>>>>>>> >> >>> >>> >> they're the
>>>>>>>> >> >>> >>> >> most important thing in Spark :) On the Spark main
>>>>>>>> page there
>>>>>>>> >> >>> >>> >> is
>>>>>>>> >> >>> >>> >> still
>>>>>>>> >> >>> >>> >> chart
>>>>>>>> >> >>> >>> >> "Spark vs Hadoop". It is important to show that
>>>>>>>> framework is
>>>>>>>> >> >>> >>> >> not
>>>>>>>> >> >>> >>> >> the
>>>>>>>> >> >>> >>> >> same
>>>>>>>> >> >>> >>> >> Spark with other API, but much faster and optimized,
>>>>>>>> comparable
>>>>>>>> >> >>> >>> >> or
>>>>>>>> >> >>> >>> >> even
>>>>>>>> >> >>> >>> >> faster than other frameworks.
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> About real-time streaming, I think it would be just
>>>>>>>> good to see
>>>>>>>> >> >>> >>> >> it
>>>>>>>> >> >>> >>> >> in
>>>>>>>> >> >>> >>> >> Spark.
>>>>>>>> >> >>> >>> >> I very like current Spark model, but many voices that
>>>>>>>> says "we
>>>>>>>> >> >>> >>> >> need
>>>>>>>> >> >>> >>> >> more" -
>>>>>>>> >> >>> >>> >> community should listen also them and try to help
>>>>>>>> them. With
>>>>>>>> >> >>> >>> >> SIPs
>>>>>>>> >> >>> >>> >> it
>>>>>>>> >> >>> >>> >> would
>>>>>>>> >> >>> >>> >> be easier, I've just posted this example as "thing
>>>>>>>> that may be
>>>>>>>> >> >>> >>> >> changed
>>>>>>>> >> >>> >>> >> with
>>>>>>>> >> >>> >>> >> SIP".
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> I very like unification via Datasets, but there is a
>>>>>>>> lot of
>>>>>>>> >> >>> >>> >> algorithms
>>>>>>>> >> >>> >>> >> inside - let's make easy API, but with strong
>>>>>>>> background
>>>>>>>> >> >>> >>> >> (articles,
>>>>>>>> >> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark
>>>>>>>> is still
>>>>>>>> >> >>> >>> >> modern
>>>>>>>> >> >>> >>> >> framework.
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said
>>>>>>>> >> >>> >>> >> organizational
>>>>>>>> >> >>> >>> >> ideas
>>>>>>>> >> >>> >>> >> were already mentioned and I agree with them, my mail
>>>>>>>> was just
>>>>>>>> >> >>> >>> >> to
>>>>>>>> >> >>> >>> >> show
>>>>>>>> >> >>> >>> >> some
>>>>>>>> >> >>> >>> >> aspects from my side, so from theside of developer
>>>>>>>> and person
>>>>>>>> >> >>> >>> >> who
>>>>>>>> >> >>> >>> >> is
>>>>>>>> >> >>> >>> >> trying
>>>>>>>> >> >>> >>> >> to help others with Spark (via StackOverflow or other
>>>>>>>> ways)
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> Pozdrawiam / Best regards,
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> Tomasz
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> ________________________________
>>>>>>>> >> >>> >>> >> Od: Cody Koeninger <[email protected]>
>>>>>>>> >> >>> >>> >> Wysłane: 17 października 2016 16:46
>>>>>>>> >> >>> >>> >> Do: Debasish Das
>>>>>>>> >> >>> >>> >> DW: Tomasz Gawęda; [email protected]
>>>>>>>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is
>>>>>>>> missing my
>>>>>>>> >> >>> >>> >> point.
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> My point is evolve or die.  Spark's governance and
>>>>>>>> organization
>>>>>>>> >> >>> >>> >> is
>>>>>>>> >> >>> >>> >> hampering its ability to evolve technologically, and
>>>>>>>> it needs
>>>>>>>> >> >>> >>> >> to
>>>>>>>> >> >>> >>> >> change.
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>>>>>>>> >> >>> >>> >> <[email protected]>
>>>>>>>> >> >>> >>> >> wrote:
>>>>>>>> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I picked
>>>>>>>> up Spark
>>>>>>>> >> >>> >>> >>> in
>>>>>>>> >> >>> >>> >>> 2014
>>>>>>>> >> >>> >>> >>> as
>>>>>>>> >> >>> >>> >>> soon as I looked into it since compared to writing
>>>>>>>> Java
>>>>>>>> >> >>> >>> >>> map-reduce
>>>>>>>> >> >>> >>> >>> and
>>>>>>>> >> >>> >>> >>> Cascading code, Spark made writing distributed code
>>>>>>>> fun...But
>>>>>>>> >> >>> >>> >>> now
>>>>>>>> >> >>> >>> >>> as
>>>>>>>> >> >>> >>> >>> we
>>>>>>>> >> >>> >>> >>> went
>>>>>>>> >> >>> >>> >>> deeper with Spark and real-time streaming use-case
>>>>>>>> gets more
>>>>>>>> >> >>> >>> >>> prominent, I
>>>>>>>> >> >>> >>> >>> think it is time to bring a messaging model in
>>>>>>>> conjunction
>>>>>>>> >> >>> >>> >>> with
>>>>>>>> >> >>> >>> >>> the
>>>>>>>> >> >>> >>> >>> batch/micro-batch API that Spark is good
>>>>>>>> at....akka-streams
>>>>>>>> >> >>> >>> >>> close
>>>>>>>> >> >>> >>> >>> integration with spark micro-batching APIs looks
>>>>>>>> like a great
>>>>>>>> >> >>> >>> >>> direction to
>>>>>>>> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0
>>>>>>>> integrated
>>>>>>>> >> >>> >>> >>> streaming
>>>>>>>> >> >>> >>> >>> with
>>>>>>>> >> >>> >>> >>> batch with the assumption is that micro-batching is
>>>>>>>> sufficient
>>>>>>>> >> >>> >>> >>> to
>>>>>>>> >> >>> >>> >>> run
>>>>>>>> >> >>> >>> >>> SQL
>>>>>>>> >> >>> >>> >>> commands on stream but do we really have time to do
>>>>>>>> SQL
>>>>>>>> >> >>> >>> >>> processing at
>>>>>>>> >> >>> >>> >>> streaming data within 1-2 seconds ?
>>>>>>>> >> >>> >>> >>>
>>>>>>>> >> >>> >>> >>> After reading the email chain, I started to look
>>>>>>>> into Flink
>>>>>>>> >> >>> >>> >>> documentation
>>>>>>>> >> >>> >>> >>> and if you compare it with Spark documentation, I
>>>>>>>> think we
>>>>>>>> >> >>> >>> >>> have
>>>>>>>> >> >>> >>> >>> major
>>>>>>>> >> >>> >>> >>> work
>>>>>>>> >> >>> >>> >>> to do detailing out Spark internals so that more
>>>>>>>> people from
>>>>>>>> >> >>> >>> >>> community
>>>>>>>> >> >>> >>> >>> start
>>>>>>>> >> >>> >>> >>> to take active role in improving the issues so that
>>>>>>>> Spark
>>>>>>>> >> >>> >>> >>> stays
>>>>>>>> >> >>> >>> >>> strong
>>>>>>>> >> >>> >>> >>> compared to Flink.
>>>>>>>> >> >>> >>> >>>
>>>>>>>> >> >>> >>> >>>
>>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>>>> uence/display/SPARK/Spark+Internals
>>>>>>>> >> >>> >>> >>>
>>>>>>>> >> >>> >>> >>>
>>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>>>> uence/display/FLINK/Flink+Internals
>>>>>>>> >> >>> >>> >>>
>>>>>>>> >> >>> >>> >>> Spark is no longer an engine that works for
>>>>>>>> micro-batch and
>>>>>>>> >> >>> >>> >>> batch...We
>>>>>>>> >> >>> >>> >>> (and
>>>>>>>> >> >>> >>> >>> I am sure many others) are pushing spark as an
>>>>>>>> engine for
>>>>>>>> >> >>> >>> >>> stream
>>>>>>>> >> >>> >>> >>> and
>>>>>>>> >> >>> >>> >>> query
>>>>>>>> >> >>> >>> >>> processing.....we need to make it a state-of-the-art
>>>>>>>> engine
>>>>>>>> >> >>> >>> >>> for
>>>>>>>> >> >>> >>> >>> high
>>>>>>>> >> >>> >>> >>> speed
>>>>>>>> >> >>> >>> >>> streaming data and user queries as well !
>>>>>>>> >> >>> >>> >>>
>>>>>>>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>>>>>>>> >> >>> >>> >>> <[email protected]>
>>>>>>>> >> >>> >>> >>> wrote:
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> Hi everyone,
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> I'm quite late with my answer, but I think my
>>>>>>>> suggestions may
>>>>>>>> >> >>> >>> >>>> help a
>>>>>>>> >> >>> >>> >>>> little bit. :) Many technical and organizational
>>>>>>>> topics were
>>>>>>>> >> >>> >>> >>>> mentioned,
>>>>>>>> >> >>> >>> >>>> but I want to focus on these negative posts about
>>>>>>>> Spark and
>>>>>>>> >> >>> >>> >>>> about
>>>>>>>> >> >>> >>> >>>> "haters"
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very good
>>>>>>>> community
>>>>>>>> >> >>> >>> >>>> -
>>>>>>>> >> >>> >>> >>>> it's
>>>>>>>> >> >>> >>> >>>> everything here. But Every project has to "flight"
>>>>>>>> on
>>>>>>>> >> >>> >>> >>>> "framework
>>>>>>>> >> >>> >>> >>>> market"
>>>>>>>> >> >>> >>> >>>> to be still no 1. I'm following many Spark and Big
>>>>>>>> Data
>>>>>>>> >> >>> >>> >>>> communities,
>>>>>>>> >> >>> >>> >>>> maybe my mail will inspire someone :)
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have
>>>>>>>> enough time
>>>>>>>> >> >>> >>> >>>> to
>>>>>>>> >> >>> >>> >>>> join
>>>>>>>> >> >>> >>> >>>> contributing to Spark) has done excellent job. So
>>>>>>>> why are
>>>>>>>> >> >>> >>> >>>> some
>>>>>>>> >> >>> >>> >>>> people
>>>>>>>> >> >>> >>> >>>> saying that Flink (or other framework) is better,
>>>>>>>> like it was
>>>>>>>> >> >>> >>> >>>> posted
>>>>>>>> >> >>> >>> >>>> in
>>>>>>>> >> >>> >>> >>>> this mailing list? No, not because that framework
>>>>>>>> is better
>>>>>>>> >> >>> >>> >>>> in
>>>>>>>> >> >>> >>> >>>> all
>>>>>>>> >> >>> >>> >>>> cases.. In my opinion, many of these discussions
>>>>>>>> where
>>>>>>>> >> >>> >>> >>>> started
>>>>>>>> >> >>> >>> >>>> after
>>>>>>>> >> >>> >>> >>>> Flink marketing-like posts. Please look at
>>>>>>>> StackOverflow
>>>>>>>> >> >>> >>> >>>> "Flink
>>>>>>>> >> >>> >>> >>>> vs
>>>>>>>> >> >>> >>> >>>> ...."
>>>>>>>> >> >>> >>> >>>> posts, almost every post in "winned" by Flink.
>>>>>>>> Answers are
>>>>>>>> >> >>> >>> >>>> sometimes
>>>>>>>> >> >>> >>> >>>> saying nothing about other frameworks, Flink's
>>>>>>>> users (often
>>>>>>>> >> >>> >>> >>>> PMC's)
>>>>>>>> >> >>> >>> >>>> are
>>>>>>>> >> >>> >>> >>>> just posting same information about real-time
>>>>>>>> streaming,
>>>>>>>> >> >>> >>> >>>> about
>>>>>>>> >> >>> >>> >>>> delta
>>>>>>>> >> >>> >>> >>>> iterations, etc. It look smart and very often it is
>>>>>>>> marked as
>>>>>>>> >> >>> >>> >>>> an
>>>>>>>> >> >>> >>> >>>> aswer,
>>>>>>>> >> >>> >>> >>>> even if - in my opinion - there wasn't told all the
>>>>>>>> truth.
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> My suggestion: I don't have enough money and
>>>>>>>> knowledgle to
>>>>>>>> >> >>> >>> >>>> perform
>>>>>>>> >> >>> >>> >>>> huge
>>>>>>>> >> >>> >>> >>>> performance test. Maybe some company, that supports
>>>>>>>> Spark
>>>>>>>> >> >>> >>> >>>> (Databricks,
>>>>>>>> >> >>> >>> >>>> Cloudera? - just saying you're most visible in
>>>>>>>> community :) )
>>>>>>>> >> >>> >>> >>>> could
>>>>>>>> >> >>> >>> >>>> perform performance test of:
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> - streaming engine - probably Spark will loose
>>>>>>>> because of
>>>>>>>> >> >>> >>> >>>> mini-batch
>>>>>>>> >> >>> >>> >>>> model, however currently the difference should be
>>>>>>>> much lower
>>>>>>>> >> >>> >>> >>>> that in
>>>>>>>> >> >>> >>> >>>> previous versions
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> - Machine Learning models
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> - batch jobs
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> - Graph jobs
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> - SQL queries
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> People will see that Spark is envolving and is also
>>>>>>>> a modern
>>>>>>>> >> >>> >>> >>>> framework,
>>>>>>>> >> >>> >>> >>>> because after reading posts mentioned above people
>>>>>>>> may think
>>>>>>>> >> >>> >>> >>>> "it
>>>>>>>> >> >>> >>> >>>> is
>>>>>>>> >> >>> >>> >>>> outdated, future is in framework X".
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about how
>>>>>>>> Spark
>>>>>>>> >> >>> >>> >>>> Structured
>>>>>>>> >> >>> >>> >>>> Streaming beats every other framework in terms of
>>>>>>>> easy-of-use
>>>>>>>> >> >>> >>> >>>> and
>>>>>>>> >> >>> >>> >>>> reliability. Performance tests, done in various
>>>>>>>> environments
>>>>>>>> >> >>> >>> >>>> (in
>>>>>>>> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node
>>>>>>>> cluster,
>>>>>>>> >> >>> >>> >>>> 20-node
>>>>>>>> >> >>> >>> >>>> cluster), could be also very good marketing stuff
>>>>>>>> to say
>>>>>>>> >> >>> >>> >>>> "hey,
>>>>>>>> >> >>> >>> >>>> you're
>>>>>>>> >> >>> >>> >>>> telling that you're better, but Spark is still
>>>>>>>> faster and is
>>>>>>>> >> >>> >>> >>>> still
>>>>>>>> >> >>> >>> >>>> getting even more fast!". This would be based on
>>>>>>>> facts (just
>>>>>>>> >> >>> >>> >>>> numbers),
>>>>>>>> >> >>> >>> >>>> not opinions. It would be good for companies, for
>>>>>>>> marketing
>>>>>>>> >> >>> >>> >>>> puproses
>>>>>>>> >> >>> >>> >>>> and
>>>>>>>> >> >>> >>> >>>> for every Spark developer
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> Second: real-time streaming. I've written some time
>>>>>>>> ago about
>>>>>>>> >> >>> >>> >>>> real-time
>>>>>>>> >> >>> >>> >>>> streaming support in Spark Structured Streaming.
>>>>>>>> Some work
>>>>>>>> >> >>> >>> >>>> should be
>>>>>>>> >> >>> >>> >>>> done to make SSS more low-latency, but I think it's
>>>>>>>> possible.
>>>>>>>> >> >>> >>> >>>> Maybe
>>>>>>>> >> >>> >>> >>>> Spark may look at Gearpump, which is also built on
>>>>>>>> top of
>>>>>>>> >> >>> >>> >>>> Akka?
>>>>>>>> >> >>> >>> >>>> I
>>>>>>>> >> >>> >>> >>>> don't
>>>>>>>> >> >>> >>> >>>> know yet, it is good topic for SIP. However I think
>>>>>>>> that
>>>>>>>> >> >>> >>> >>>> Spark
>>>>>>>> >> >>> >>> >>>> should
>>>>>>>> >> >>> >>> >>>> have real-time streaming support. Currently I see
>>>>>>>> many
>>>>>>>> >> >>> >>> >>>> posts/comments
>>>>>>>> >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming
>>>>>>>> is doing
>>>>>>>> >> >>> >>> >>>> very
>>>>>>>> >> >>> >>> >>>> good
>>>>>>>> >> >>> >>> >>>> jobs with micro-batches, however I think it is
>>>>>>>> possible to
>>>>>>>> >> >>> >>> >>>> add
>>>>>>>> >> >>> >>> >>>> also
>>>>>>>> >> >>> >>> >>>> more
>>>>>>>> >> >>> >>> >>>> real-time processing.
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> Other people said much more and I agree with
>>>>>>>> proposal of SIP.
>>>>>>>> >> >>> >>> >>>> I'm
>>>>>>>> >> >>> >>> >>>> also
>>>>>>>> >> >>> >>> >>>> happy that PMC's are not saying that they will not
>>>>>>>> listen to
>>>>>>>> >> >>> >>> >>>> users,
>>>>>>>> >> >>> >>> >>>> but
>>>>>>>> >> >>> >>> >>>> they really want to make Spark better for every
>>>>>>>> user.
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> What do you think about these two topics?
>>>>>>>> Especially I'm
>>>>>>>> >> >>> >>> >>>> looking
>>>>>>>> >> >>> >>> >>>> at
>>>>>>>> >> >>> >>> >>>> Cody
>>>>>>>> >> >>> >>> >>>> (who has started this topic) and PMCs :)
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> Pozdrawiam / Best regards,
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> Tomasz
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>>
>>>>>>>> >> >>> >>
>>>>>>>> >> >>> >
>>>>>>>> >> >>> >
>>>>>>>> >> >
>>>>>>>> >> >
>>>>>>>> >>
>>>>>>>> >> ------------------------------------------------------------
>>>>>>>> ---------
>>>>>>>> >> To unsubscribe e-mail: [email protected]
>>>>>>>> >>
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > --
>>>>>>>> > Ryan Blue
>>>>>>>> > Software Engineer
>>>>>>>> > Netflix
>>>>>>>>
>>>>>>>> ------------------------------------------------------------
>>>>>>>> ---------
>>>>>>>> To unsubscribe e-mail: [email protected]
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Joseph Bradley
>>>>>>>
>>>>>>> Software Engineer - Machine Learning
>>>>>>>
>>>>>>> Databricks, Inc.
>>>>>>>
>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark Improvement Proposals

Reply via email to