Re: Spark Improvement Proposals

Reynold Xin Wed, 11 Jan 2017 01:14:32 -0800

+1 on all counts (consensus, time bound, define roles)

I can update the doc in the next few days and share back. Then maybe we can
just officially vote on this. As Tim suggested, we might not get it 100%
right the first time and would need to re-iterate. But that's fine.



On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <[email protected]> wrote:

> Hi Cody,
> thank you for bringing up this topic, I agree it is very important to keep
> a cohesive community around some common, fluid goals. Here are a few
> comments about the current document:
>
> 1. name: it should not overlap with an existing one such as SIP. Can you
> imagine someone trying to discuss a scala spore proposal for spark?
> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
> sounds great.
>
> 2. roles: at a high level, SPIPs are meant to reach consensus for
> technical decisions with a lasting impact. As such, the template should
> emphasize the role of the various parties during this process:
>
>  - the SPIP author is responsible for building consensus. She is the
> champion driving the process forward and is responsible for ensuring that
> the SPIP follows the general guidelines. The author should be identified in
> the SPIP. The authorship of a SPIP can be transferred if the current author
> is not interested and someone else wants to move the SPIP forward. There
> should probably be 2-3 authors at most for each SPIP.
>
>  - someone with voting power should probably shepherd the SPIP (and be
> recorded as such): ensuring that the final decision over the SPIP is
> recorded (rejected, accepted, etc.), and advising about the technical
> quality of the SPIP: this person need not be a champion for the SPIP or
> contribute to it, but rather makes sure it stands a chance of being
> approved when the vote happens. Also, if the author cannot find anyone who
> would want to take this role, this proposal is likely to be rejected anyway.
>
>  - users, committers, contributors have the roles already outlined in the
> document
>
> 3. timeline: ideally, once a SPIP has been offered for voting, it should
> move swiftly into either being accepted or rejected, so that we do not end
> up with a distracting long tail of half-hearted proposals.
>
> These rules are meant to be flexible, but the current document should be
> clear about who is in charge of a SPIP, and the state it is currently in.
>
> We have had long discussions over some very important questions such as
> approval. I do not have an opinion on these, but why not make a pick and
> reevaluate this decision later? This is not a binding process at this point.
>
> Tim
>
>
> On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger <[email protected]> wrote:
>
>> I don't have a concern about voting vs consensus.
>>
>> I have a concern that whatever the decision making process is, it is
>> explicitly announced on the ticket for the given proposal, with an explicit
>> deadline, and an explicit outcome.
>>
>>
>> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <[email protected]>
>> wrote:
>>
>>> I'm also in favor of this.  Thanks for your persistence Cody.
>>>
>>> My take on the specific issues Joseph mentioned:
>>>
>>> 1) voting vs. consensus -- I agree with the argument Ryan Blue made
>>> earlier for consensus:
>>>
>>> > Majority vs consensus: My rationale is that I don't think we want to
>>> consider a proposal approved if it had objections serious enough that
>>> committers down-voted (or PMC depending on who gets a vote). If these
>>> proposals are like PEPs, then they represent a significant amount of
>>> community effort and I wouldn't want to move forward if up to half of the
>>> community thinks it's an untenable idea.
>>>
>>> 2) Design doc template -- agree this would be useful, but also seems
>>> totally orthogonal to moving forward on the SIP proposal.
>>>
>>> 3) agree w/ Joseph's proposal for updating the template.
>>>
>>> One small addition:
>>>
>>> 4) Deciding on a name -- minor, but I think its wroth disambiguating
>>> from Scala's SIPs, and the best proposal I've heard is "SPIP".   At least,
>>> no one has objected.  (don't care enough that I'd object to anything else,
>>> though.)
>>>
>>>
>>> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley <[email protected]>
>>> wrote:
>>>
>>>> Hi Cody,
>>>>
>>>> Thanks for being persistent about this.  I too would like to see this
>>>> happen.  Reviewing the thread, it sounds like the main things remaining 
>>>> are:
>>>> * Decide about a few issues
>>>> * Finalize the doc(s)
>>>> * Vote on this proposal
>>>>
>>>> Issues & TODOs:
>>>>
>>>> (1) The main issue I see above is voting vs. consensus.  I have little
>>>> preference here.  It sounds like something which could be tailored based on
>>>> whether we see too many or too few SIPs being approved.
>>>>
>>>> (2) Design doc template  (This would be great to have for Spark
>>>> regardless of this SIP discussion.)
>>>> * Reynold, are you still putting this together?
>>>>
>>>> (3) Template cleanups.  Listing some items mentioned above + a new one
>>>> w.r.t. Reynold's draft
>>>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#>
>>>> :
>>>> * Reinstate the "Where" section with links to current and past SIPs
>>>> * Add field for stating explicit deadlines for approval
>>>> * Add field for stating Author & Committer shepherd
>>>>
>>>> Thanks all!
>>>> Joseph
>>>>
>>>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <[email protected]>
>>>> wrote:
>>>>
>>>>> I'm bumping this one more time for the new year, and then I'm giving
>>>>> up.
>>>>>
>>>>> Please, fix your process, even if it isn't exactly the way I suggested.
>>>>>
>>>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <[email protected]> wrote:
>>>>> > On lazy consensus as opposed to voting:
>>>>> >
>>>>> > First, why lazy consensus? The proposal was for consensus, which is
>>>>> at least
>>>>> > three +1 votes and no vetos. Consensus has no losing side, it
>>>>> requires
>>>>> > getting to a point where there is agreement. Isn't that agreement
>>>>> what we
>>>>> > want to achieve with these proposals?
>>>>> >
>>>>> > Second, lazy consensus only removes the requirement for three +1
>>>>> votes. Why
>>>>> > would we not want at least three committers to think something is a
>>>>> good
>>>>> > idea before adopting the proposal?
>>>>> >
>>>>> > rb
>>>>> >
>>>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <[email protected]>
>>>>> wrote:
>>>>> >>
>>>>> >> So there are some minor things (the Where section heading appears to
>>>>> >> be dropped; wherever this document is posted it needs to actually
>>>>> link
>>>>> >> to a jira filter showing current / past SIPs) but it doesn't look
>>>>> like
>>>>> >> I can comment on the google doc.
>>>>> >>
>>>>> >> The major substantive issue that I have is that this version is
>>>>> >> significantly less clear as to the outcome of an SIP.
>>>>> >>
>>>>> >> The apache example of lazy consensus at
>>>>> >> http://apache.org/foundation/voting.html#LazyConsensus involves an
>>>>> >> explicit announcement of an explicit deadline, which I think are
>>>>> >> necessary for clarity.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <[email protected]>
>>>>> wrote:
>>>>> >> > It turned out suggested edits (trackable) don't show up for
>>>>> non-owners,
>>>>> >> > so
>>>>> >> > I've just merged all the edits in place. It should be visible now.
>>>>> >> >
>>>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <[email protected]
>>>>> >
>>>>> >> > wrote:
>>>>> >> >>
>>>>> >> >> Oops. Let me try figure that out.
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> On Monday, November 7, 2016, Cody Koeninger <[email protected]>
>>>>> wrote:
>>>>> >> >>>
>>>>> >> >>> Thanks for picking up on this.
>>>>> >> >>>
>>>>> >> >>> Maybe I fail at google docs, but I can't see any edits on the
>>>>> document
>>>>> >> >>> you linked.
>>>>> >> >>>
>>>>> >> >>> Regarding lazy consensus, if the board in general has less of
>>>>> an issue
>>>>> >> >>> with that, sure.  As long as it is clearly announced, lasts at
>>>>> least
>>>>> >> >>> 72 hours, and has a clear outcome.
>>>>> >> >>>
>>>>> >> >>> The other points are hard to comment on without being able to
>>>>> see the
>>>>> >> >>> text in question.
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <
>>>>> [email protected]>
>>>>> >> >>> wrote:
>>>>> >> >>> > I just looked through the entire thread again tonight - there
>>>>> are a
>>>>> >> >>> > lot
>>>>> >> >>> > of
>>>>> >> >>> > great ideas being discussed. Thanks Cody for taking the first
>>>>> crack
>>>>> >> >>> > at
>>>>> >> >>> > the
>>>>> >> >>> > proposal.
>>>>> >> >>> >
>>>>> >> >>> > I want to first comment on the context. Spark is one of the
>>>>> most
>>>>> >> >>> > innovative
>>>>> >> >>> > and important projects in (big) data -- overall technical
>>>>> decisions
>>>>> >> >>> > made in
>>>>> >> >>> > Apache Spark are sound. But of course, a project as large and
>>>>> active
>>>>> >> >>> > as
>>>>> >> >>> > Spark always have room for improvement, and we as a community
>>>>> should
>>>>> >> >>> > strive
>>>>> >> >>> > to take it to the next level.
>>>>> >> >>> >
>>>>> >> >>> > To that end, the two biggest areas for improvements in my
>>>>> opinion
>>>>> >> >>> > are:
>>>>> >> >>> >
>>>>> >> >>> > 1. Visibility: There are so much happening that it is
>>>>> difficult to
>>>>> >> >>> > know
>>>>> >> >>> > what
>>>>> >> >>> > really is going on. For people that don't follow closely, it
>>>>> is
>>>>> >> >>> > difficult to
>>>>> >> >>> > know what the important initiatives are. Even for people that
>>>>> do
>>>>> >> >>> > follow, it
>>>>> >> >>> > is difficult to know what specific things require their
>>>>> attention,
>>>>> >> >>> > since the
>>>>> >> >>> > number of pull requests and JIRA tickets are high and it's
>>>>> difficult
>>>>> >> >>> > to
>>>>> >> >>> > extract signal from noise.
>>>>> >> >>> >
>>>>> >> >>> > 2. Solicit user (broadly defined, including developers
>>>>> themselves)
>>>>> >> >>> > input
>>>>> >> >>> > more proactively: At the end of the day the project provides
>>>>> value
>>>>> >> >>> > because
>>>>> >> >>> > users use it. Users can't tell us exactly what to build, but
>>>>> it is
>>>>> >> >>> > important
>>>>> >> >>> > to get their inputs.
>>>>> >> >>> >
>>>>> >> >>> >
>>>>> >> >>> > I've taken Cody's doc and edited it:
>>>>> >> >>> >
>>>>> >> >>> >
>>>>> >> >>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
>>>>> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
>>>>> >> >>> > (I've made all my modifications trackable)
>>>>> >> >>> >
>>>>> >> >>> > There are couple high level changes I made:
>>>>> >> >>> >
>>>>> >> >>> > 1. I've consulted a board member and he recommended lazy
>>>>> consensus
>>>>> >> >>> > as
>>>>> >> >>> > opposed to voting. The reason being in voting there can
>>>>> easily be a
>>>>> >> >>> > "loser'
>>>>> >> >>> > that gets outvoted.
>>>>> >> >>> >
>>>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to
>>>>> "optional
>>>>> >> >>> > design
>>>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far aside
>>>>> from
>>>>> >> >>> > tagging
>>>>> >> >>> > things and linking them elsewhere simply having design docs
>>>>> and
>>>>> >> >>> > prototypes
>>>>> >> >>> > implementations in PRs is not something that has not worked
>>>>> so far".
>>>>> >> >>> >
>>>>> >> >>> > 3. I made some the language tweaks to focus more on
>>>>> visibility. For
>>>>> >> >>> > example,
>>>>> >> >>> > "The purpose of an SIP is to inform and involve", rather than
>>>>> just
>>>>> >> >>> > "involve". SIPs should also have at least two emails that go
>>>>> to
>>>>> >> >>> > dev@.
>>>>> >> >>> >
>>>>> >> >>> >
>>>>> >> >>> > While I was editing this, I thought we really needed a
>>>>> suggested
>>>>> >> >>> > template
>>>>> >> >>> > for design doc too. I will get to that too ...
>>>>> >> >>> >
>>>>> >> >>> >
>>>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <
>>>>> [email protected]>
>>>>> >> >>> > wrote:
>>>>> >> >>> >>
>>>>> >> >>> >> Most things looked OK to me too, although I do plan to take a
>>>>> >> >>> >> closer
>>>>> >> >>> >> look
>>>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1.
>>>>> >> >>> >>
>>>>> >> >>> >>
>>>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>>>>> >> >>> >> <[email protected]>
>>>>> >> >>> >> wrote:
>>>>> >> >>> >>>
>>>>> >> >>> >>> The proposal looks OK to me. I assume, even though it's not
>>>>> >> >>> >>> explicitly
>>>>> >> >>> >>> called, that voting would happen by e-mail? A template for
>>>>> the
>>>>> >> >>> >>> proposal document (instead of just a bullet nice) would
>>>>> also be
>>>>> >> >>> >>> nice,
>>>>> >> >>> >>> but that can be done at any time.
>>>>> >> >>> >>>
>>>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a
>>>>> >> >>> >>> candidate
>>>>> >> >>> >>> for a SIP, given the scope of the work. The document
>>>>> attached even
>>>>> >> >>> >>> somewhat matches the proposed format. So if anyone wants to
>>>>> try
>>>>> >> >>> >>> out
>>>>> >> >>> >>> the process...
>>>>> >> >>> >>>
>>>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>>>>> >> >>> >>> <[email protected]>
>>>>> >> >>> >>> wrote:
>>>>> >> >>> >>> > Now that spark summit europe is over, are any committers
>>>>> >> >>> >>> > interested
>>>>> >> >>> >>> > in
>>>>> >> >>> >>> > moving forward with this?
>>>>> >> >>> >>> >
>>>>> >> >>> >>> >
>>>>> >> >>> >>> >
>>>>> >> >>> >>> >
>>>>> >> >>> >>> > https://github.com/koeninger/s
>>>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>>> >> >>> >>> >
>>>>> >> >>> >>> > Or are we going to let this discussion die on the vine?
>>>>> >> >>> >>> >
>>>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>>>> >> >>> >>> > <[email protected]> wrote:
>>>>> >> >>> >>> >> Maybe my mail was not clear enough.
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any other
>>>>> >> >>> >>> >> framework.
>>>>> >> >>> >>> >> The
>>>>> >> >>> >>> >> idea with benchmarks was to show two things:
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> - why some people are doing bad PR for Spark
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> - how - in easy way - we can change it and show that
>>>>> Spark is
>>>>> >> >>> >>> >> still on
>>>>> >> >>> >>> >> the
>>>>> >> >>> >>> >> top
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I
>>>>> don't think
>>>>> >> >>> >>> >> they're the
>>>>> >> >>> >>> >> most important thing in Spark :) On the Spark main page
>>>>> there
>>>>> >> >>> >>> >> is
>>>>> >> >>> >>> >> still
>>>>> >> >>> >>> >> chart
>>>>> >> >>> >>> >> "Spark vs Hadoop". It is important to show that
>>>>> framework is
>>>>> >> >>> >>> >> not
>>>>> >> >>> >>> >> the
>>>>> >> >>> >>> >> same
>>>>> >> >>> >>> >> Spark with other API, but much faster and optimized,
>>>>> comparable
>>>>> >> >>> >>> >> or
>>>>> >> >>> >>> >> even
>>>>> >> >>> >>> >> faster than other frameworks.
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> About real-time streaming, I think it would be just good
>>>>> to see
>>>>> >> >>> >>> >> it
>>>>> >> >>> >>> >> in
>>>>> >> >>> >>> >> Spark.
>>>>> >> >>> >>> >> I very like current Spark model, but many voices that
>>>>> says "we
>>>>> >> >>> >>> >> need
>>>>> >> >>> >>> >> more" -
>>>>> >> >>> >>> >> community should listen also them and try to help them.
>>>>> With
>>>>> >> >>> >>> >> SIPs
>>>>> >> >>> >>> >> it
>>>>> >> >>> >>> >> would
>>>>> >> >>> >>> >> be easier, I've just posted this example as "thing that
>>>>> may be
>>>>> >> >>> >>> >> changed
>>>>> >> >>> >>> >> with
>>>>> >> >>> >>> >> SIP".
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> I very like unification via Datasets, but there is a lot
>>>>> of
>>>>> >> >>> >>> >> algorithms
>>>>> >> >>> >>> >> inside - let's make easy API, but with strong background
>>>>> >> >>> >>> >> (articles,
>>>>> >> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark is
>>>>> still
>>>>> >> >>> >>> >> modern
>>>>> >> >>> >>> >> framework.
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said
>>>>> >> >>> >>> >> organizational
>>>>> >> >>> >>> >> ideas
>>>>> >> >>> >>> >> were already mentioned and I agree with them, my mail
>>>>> was just
>>>>> >> >>> >>> >> to
>>>>> >> >>> >>> >> show
>>>>> >> >>> >>> >> some
>>>>> >> >>> >>> >> aspects from my side, so from theside of developer and
>>>>> person
>>>>> >> >>> >>> >> who
>>>>> >> >>> >>> >> is
>>>>> >> >>> >>> >> trying
>>>>> >> >>> >>> >> to help others with Spark (via StackOverflow or other
>>>>> ways)
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> Pozdrawiam / Best regards,
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> Tomasz
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> ________________________________
>>>>> >> >>> >>> >> Od: Cody Koeninger <[email protected]>
>>>>> >> >>> >>> >> Wysłane: 17 października 2016 16:46
>>>>> >> >>> >>> >> Do: Debasish Das
>>>>> >> >>> >>> >> DW: Tomasz Gawęda; [email protected]
>>>>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is
>>>>> missing my
>>>>> >> >>> >>> >> point.
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> My point is evolve or die.  Spark's governance and
>>>>> organization
>>>>> >> >>> >>> >> is
>>>>> >> >>> >>> >> hampering its ability to evolve technologically, and it
>>>>> needs
>>>>> >> >>> >>> >> to
>>>>> >> >>> >>> >> change.
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>>>>> >> >>> >>> >> <[email protected]>
>>>>> >> >>> >>> >> wrote:
>>>>> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I picked up
>>>>> Spark
>>>>> >> >>> >>> >>> in
>>>>> >> >>> >>> >>> 2014
>>>>> >> >>> >>> >>> as
>>>>> >> >>> >>> >>> soon as I looked into it since compared to writing Java
>>>>> >> >>> >>> >>> map-reduce
>>>>> >> >>> >>> >>> and
>>>>> >> >>> >>> >>> Cascading code, Spark made writing distributed code
>>>>> fun...But
>>>>> >> >>> >>> >>> now
>>>>> >> >>> >>> >>> as
>>>>> >> >>> >>> >>> we
>>>>> >> >>> >>> >>> went
>>>>> >> >>> >>> >>> deeper with Spark and real-time streaming use-case gets
>>>>> more
>>>>> >> >>> >>> >>> prominent, I
>>>>> >> >>> >>> >>> think it is time to bring a messaging model in
>>>>> conjunction
>>>>> >> >>> >>> >>> with
>>>>> >> >>> >>> >>> the
>>>>> >> >>> >>> >>> batch/micro-batch API that Spark is good
>>>>> at....akka-streams
>>>>> >> >>> >>> >>> close
>>>>> >> >>> >>> >>> integration with spark micro-batching APIs looks like a
>>>>> great
>>>>> >> >>> >>> >>> direction to
>>>>> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0
>>>>> integrated
>>>>> >> >>> >>> >>> streaming
>>>>> >> >>> >>> >>> with
>>>>> >> >>> >>> >>> batch with the assumption is that micro-batching is
>>>>> sufficient
>>>>> >> >>> >>> >>> to
>>>>> >> >>> >>> >>> run
>>>>> >> >>> >>> >>> SQL
>>>>> >> >>> >>> >>> commands on stream but do we really have time to do SQL
>>>>> >> >>> >>> >>> processing at
>>>>> >> >>> >>> >>> streaming data within 1-2 seconds ?
>>>>> >> >>> >>> >>>
>>>>> >> >>> >>> >>> After reading the email chain, I started to look into
>>>>> Flink
>>>>> >> >>> >>> >>> documentation
>>>>> >> >>> >>> >>> and if you compare it with Spark documentation, I think
>>>>> we
>>>>> >> >>> >>> >>> have
>>>>> >> >>> >>> >>> major
>>>>> >> >>> >>> >>> work
>>>>> >> >>> >>> >>> to do detailing out Spark internals so that more people
>>>>> from
>>>>> >> >>> >>> >>> community
>>>>> >> >>> >>> >>> start
>>>>> >> >>> >>> >>> to take active role in improving the issues so that
>>>>> Spark
>>>>> >> >>> >>> >>> stays
>>>>> >> >>> >>> >>> strong
>>>>> >> >>> >>> >>> compared to Flink.
>>>>> >> >>> >>> >>>
>>>>> >> >>> >>> >>>
>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>> uence/display/SPARK/Spark+Internals
>>>>> >> >>> >>> >>>
>>>>> >> >>> >>> >>>
>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>> uence/display/FLINK/Flink+Internals
>>>>> >> >>> >>> >>>
>>>>> >> >>> >>> >>> Spark is no longer an engine that works for micro-batch
>>>>> and
>>>>> >> >>> >>> >>> batch...We
>>>>> >> >>> >>> >>> (and
>>>>> >> >>> >>> >>> I am sure many others) are pushing spark as an engine
>>>>> for
>>>>> >> >>> >>> >>> stream
>>>>> >> >>> >>> >>> and
>>>>> >> >>> >>> >>> query
>>>>> >> >>> >>> >>> processing.....we need to make it a state-of-the-art
>>>>> engine
>>>>> >> >>> >>> >>> for
>>>>> >> >>> >>> >>> high
>>>>> >> >>> >>> >>> speed
>>>>> >> >>> >>> >>> streaming data and user queries as well !
>>>>> >> >>> >>> >>>
>>>>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>>>>> >> >>> >>> >>> <[email protected]>
>>>>> >> >>> >>> >>> wrote:
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> Hi everyone,
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> I'm quite late with my answer, but I think my
>>>>> suggestions may
>>>>> >> >>> >>> >>>> help a
>>>>> >> >>> >>> >>>> little bit. :) Many technical and organizational
>>>>> topics were
>>>>> >> >>> >>> >>>> mentioned,
>>>>> >> >>> >>> >>>> but I want to focus on these negative posts about
>>>>> Spark and
>>>>> >> >>> >>> >>>> about
>>>>> >> >>> >>> >>>> "haters"
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very good
>>>>> community
>>>>> >> >>> >>> >>>> -
>>>>> >> >>> >>> >>>> it's
>>>>> >> >>> >>> >>>> everything here. But Every project has to "flight" on
>>>>> >> >>> >>> >>>> "framework
>>>>> >> >>> >>> >>>> market"
>>>>> >> >>> >>> >>>> to be still no 1. I'm following many Spark and Big Data
>>>>> >> >>> >>> >>>> communities,
>>>>> >> >>> >>> >>>> maybe my mail will inspire someone :)
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have
>>>>> enough time
>>>>> >> >>> >>> >>>> to
>>>>> >> >>> >>> >>>> join
>>>>> >> >>> >>> >>>> contributing to Spark) has done excellent job. So why
>>>>> are
>>>>> >> >>> >>> >>>> some
>>>>> >> >>> >>> >>>> people
>>>>> >> >>> >>> >>>> saying that Flink (or other framework) is better, like
>>>>> it was
>>>>> >> >>> >>> >>>> posted
>>>>> >> >>> >>> >>>> in
>>>>> >> >>> >>> >>>> this mailing list? No, not because that framework is
>>>>> better
>>>>> >> >>> >>> >>>> in
>>>>> >> >>> >>> >>>> all
>>>>> >> >>> >>> >>>> cases.. In my opinion, many of these discussions where
>>>>> >> >>> >>> >>>> started
>>>>> >> >>> >>> >>>> after
>>>>> >> >>> >>> >>>> Flink marketing-like posts. Please look at
>>>>> StackOverflow
>>>>> >> >>> >>> >>>> "Flink
>>>>> >> >>> >>> >>>> vs
>>>>> >> >>> >>> >>>> ...."
>>>>> >> >>> >>> >>>> posts, almost every post in "winned" by Flink. Answers
>>>>> are
>>>>> >> >>> >>> >>>> sometimes
>>>>> >> >>> >>> >>>> saying nothing about other frameworks, Flink's users
>>>>> (often
>>>>> >> >>> >>> >>>> PMC's)
>>>>> >> >>> >>> >>>> are
>>>>> >> >>> >>> >>>> just posting same information about real-time
>>>>> streaming,
>>>>> >> >>> >>> >>>> about
>>>>> >> >>> >>> >>>> delta
>>>>> >> >>> >>> >>>> iterations, etc. It look smart and very often it is
>>>>> marked as
>>>>> >> >>> >>> >>>> an
>>>>> >> >>> >>> >>>> aswer,
>>>>> >> >>> >>> >>>> even if - in my opinion - there wasn't told all the
>>>>> truth.
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> My suggestion: I don't have enough money and
>>>>> knowledgle to
>>>>> >> >>> >>> >>>> perform
>>>>> >> >>> >>> >>>> huge
>>>>> >> >>> >>> >>>> performance test. Maybe some company, that supports
>>>>> Spark
>>>>> >> >>> >>> >>>> (Databricks,
>>>>> >> >>> >>> >>>> Cloudera? - just saying you're most visible in
>>>>> community :) )
>>>>> >> >>> >>> >>>> could
>>>>> >> >>> >>> >>>> perform performance test of:
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> - streaming engine - probably Spark will loose because
>>>>> of
>>>>> >> >>> >>> >>>> mini-batch
>>>>> >> >>> >>> >>>> model, however currently the difference should be much
>>>>> lower
>>>>> >> >>> >>> >>>> that in
>>>>> >> >>> >>> >>>> previous versions
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> - Machine Learning models
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> - batch jobs
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> - Graph jobs
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> - SQL queries
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> People will see that Spark is envolving and is also a
>>>>> modern
>>>>> >> >>> >>> >>>> framework,
>>>>> >> >>> >>> >>>> because after reading posts mentioned above people may
>>>>> think
>>>>> >> >>> >>> >>>> "it
>>>>> >> >>> >>> >>>> is
>>>>> >> >>> >>> >>>> outdated, future is in framework X".
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about how
>>>>> Spark
>>>>> >> >>> >>> >>>> Structured
>>>>> >> >>> >>> >>>> Streaming beats every other framework in terms of
>>>>> easy-of-use
>>>>> >> >>> >>> >>>> and
>>>>> >> >>> >>> >>>> reliability. Performance tests, done in various
>>>>> environments
>>>>> >> >>> >>> >>>> (in
>>>>> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster,
>>>>> >> >>> >>> >>>> 20-node
>>>>> >> >>> >>> >>>> cluster), could be also very good marketing stuff to
>>>>> say
>>>>> >> >>> >>> >>>> "hey,
>>>>> >> >>> >>> >>>> you're
>>>>> >> >>> >>> >>>> telling that you're better, but Spark is still faster
>>>>> and is
>>>>> >> >>> >>> >>>> still
>>>>> >> >>> >>> >>>> getting even more fast!". This would be based on facts
>>>>> (just
>>>>> >> >>> >>> >>>> numbers),
>>>>> >> >>> >>> >>>> not opinions. It would be good for companies, for
>>>>> marketing
>>>>> >> >>> >>> >>>> puproses
>>>>> >> >>> >>> >>>> and
>>>>> >> >>> >>> >>>> for every Spark developer
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> Second: real-time streaming. I've written some time
>>>>> ago about
>>>>> >> >>> >>> >>>> real-time
>>>>> >> >>> >>> >>>> streaming support in Spark Structured Streaming. Some
>>>>> work
>>>>> >> >>> >>> >>>> should be
>>>>> >> >>> >>> >>>> done to make SSS more low-latency, but I think it's
>>>>> possible.
>>>>> >> >>> >>> >>>> Maybe
>>>>> >> >>> >>> >>>> Spark may look at Gearpump, which is also built on top
>>>>> of
>>>>> >> >>> >>> >>>> Akka?
>>>>> >> >>> >>> >>>> I
>>>>> >> >>> >>> >>>> don't
>>>>> >> >>> >>> >>>> know yet, it is good topic for SIP. However I think
>>>>> that
>>>>> >> >>> >>> >>>> Spark
>>>>> >> >>> >>> >>>> should
>>>>> >> >>> >>> >>>> have real-time streaming support. Currently I see many
>>>>> >> >>> >>> >>>> posts/comments
>>>>> >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming is
>>>>> doing
>>>>> >> >>> >>> >>>> very
>>>>> >> >>> >>> >>>> good
>>>>> >> >>> >>> >>>> jobs with micro-batches, however I think it is
>>>>> possible to
>>>>> >> >>> >>> >>>> add
>>>>> >> >>> >>> >>>> also
>>>>> >> >>> >>> >>>> more
>>>>> >> >>> >>> >>>> real-time processing.
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> Other people said much more and I agree with proposal
>>>>> of SIP.
>>>>> >> >>> >>> >>>> I'm
>>>>> >> >>> >>> >>>> also
>>>>> >> >>> >>> >>>> happy that PMC's are not saying that they will not
>>>>> listen to
>>>>> >> >>> >>> >>>> users,
>>>>> >> >>> >>> >>>> but
>>>>> >> >>> >>> >>>> they really want to make Spark better for every user.
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> What do you think about these two topics? Especially
>>>>> I'm
>>>>> >> >>> >>> >>>> looking
>>>>> >> >>> >>> >>>> at
>>>>> >> >>> >>> >>>> Cody
>>>>> >> >>> >>> >>>> (who has started this topic) and PMCs :)
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> Pozdrawiam / Best regards,
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> Tomasz
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>>
>>>>> >> >>> >>
>>>>> >> >>> >
>>>>> >> >>> >
>>>>> >> >
>>>>> >> >
>>>>> >>
>>>>> >> ------------------------------------------------------------
>>>>> ---------
>>>>> >> To unsubscribe e-mail: [email protected]
>>>>> >>
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Ryan Blue
>>>>> > Software Engineer
>>>>> > Netflix
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: [email protected]
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Joseph Bradley
>>>>
>>>> Software Engineer - Machine Learning
>>>>
>>>> Databricks, Inc.
>>>>
>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>
>>>
>>>
>>
>

Re: Spark Improvement Proposals

Reply via email to