Re: Spark Improvement Proposals

Ryan Blue Tue, 08 Nov 2016 09:15:54 -0800

On lazy consensus as opposed to voting:

First, why lazy consensus? The proposal was for consensus, which is at
least three +1 votes and no vetos. Consensus has no losing side, it
requires getting to a point where there is agreement. Isn't that agreement
what we want to achieve with these proposals?


Second, lazy consensus only removes the requirement for three +1 votes. Why
would we not want at least three committers to think something is a good
idea before adopting the proposal?

rb

On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <c...@koeninger.org> wrote:

> So there are some minor things (the Where section heading appears to
> be dropped; wherever this document is posted it needs to actually link
> to a jira filter showing current / past SIPs) but it doesn't look like
> I can comment on the google doc.
>
> The major substantive issue that I have is that this version is
> significantly less clear as to the outcome of an SIP.
>
> The apache example of lazy consensus at
> http://apache.org/foundation/voting.html#LazyConsensus involves an
> explicit announcement of an explicit deadline, which I think are
> necessary for clarity.
>
>
>
> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <r...@databricks.com> wrote:
> > It turned out suggested edits (trackable) don't show up for non-owners,
> so
> > I've just merged all the edits in place. It should be visible now.
> >
> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <r...@databricks.com>
> wrote:
> >>
> >> Oops. Let me try figure that out.
> >>
> >>
> >> On Monday, November 7, 2016, Cody Koeninger <c...@koeninger.org> wrote:
> >>>
> >>> Thanks for picking up on this.
> >>>
> >>> Maybe I fail at google docs, but I can't see any edits on the document
> >>> you linked.
> >>>
> >>> Regarding lazy consensus, if the board in general has less of an issue
> >>> with that, sure.  As long as it is clearly announced, lasts at least
> >>> 72 hours, and has a clear outcome.
> >>>
> >>> The other points are hard to comment on without being able to see the
> >>> text in question.
> >>>
> >>>
> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <r...@databricks.com>
> wrote:
> >>> > I just looked through the entire thread again tonight - there are a
> lot
> >>> > of
> >>> > great ideas being discussed. Thanks Cody for taking the first crack
> at
> >>> > the
> >>> > proposal.
> >>> >
> >>> > I want to first comment on the context. Spark is one of the most
> >>> > innovative
> >>> > and important projects in (big) data -- overall technical decisions
> >>> > made in
> >>> > Apache Spark are sound. But of course, a project as large and active
> as
> >>> > Spark always have room for improvement, and we as a community should
> >>> > strive
> >>> > to take it to the next level.
> >>> >
> >>> > To that end, the two biggest areas for improvements in my opinion
> are:
> >>> >
> >>> > 1. Visibility: There are so much happening that it is difficult to
> know
> >>> > what
> >>> > really is going on. For people that don't follow closely, it is
> >>> > difficult to
> >>> > know what the important initiatives are. Even for people that do
> >>> > follow, it
> >>> > is difficult to know what specific things require their attention,
> >>> > since the
> >>> > number of pull requests and JIRA tickets are high and it's difficult
> to
> >>> > extract signal from noise.
> >>> >
> >>> > 2. Solicit user (broadly defined, including developers themselves)
> >>> > input
> >>> > more proactively: At the end of the day the project provides value
> >>> > because
> >>> > users use it. Users can't tell us exactly what to build, but it is
> >>> > important
> >>> > to get their inputs.
> >>> >
> >>> >
> >>> > I've taken Cody's doc and edited it:
> >>> >
> >>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
> >>> > (I've made all my modifications trackable)
> >>> >
> >>> > There are couple high level changes I made:
> >>> >
> >>> > 1. I've consulted a board member and he recommended lazy consensus as
> >>> > opposed to voting. The reason being in voting there can easily be a
> >>> > "loser'
> >>> > that gets outvoted.
> >>> >
> >>> > 2. I made it lighter weight, and renamed "strategy" to "optional
> design
> >>> > sketch". Echoing one of the earlier email: "IMHO so far aside from
> >>> > tagging
> >>> > things and linking them elsewhere simply having design docs and
> >>> > prototypes
> >>> > implementations in PRs is not something that has not worked so far".
> >>> >
> >>> > 3. I made some the language tweaks to focus more on visibility. For
> >>> > example,
> >>> > "The purpose of an SIP is to inform and involve", rather than just
> >>> > "involve". SIPs should also have at least two emails that go to dev@
> .
> >>> >
> >>> >
> >>> > While I was editing this, I thought we really needed a suggested
> >>> > template
> >>> > for design doc too. I will get to that too ...
> >>> >
> >>> >
> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <r...@databricks.com>
> >>> > wrote:
> >>> >>
> >>> >> Most things looked OK to me too, although I do plan to take a closer
> >>> >> look
> >>> >> after Nov 1st when we cut the release branch for 2.1.
> >>> >>
> >>> >>
> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin <
> van...@cloudera.com>
> >>> >> wrote:
> >>> >>>
> >>> >>> The proposal looks OK to me. I assume, even though it's not
> >>> >>> explicitly
> >>> >>> called, that voting would happen by e-mail? A template for the
> >>> >>> proposal document (instead of just a bullet nice) would also be
> nice,
> >>> >>> but that can be done at any time.
> >>> >>>
> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a
> candidate
> >>> >>> for a SIP, given the scope of the work. The document attached even
> >>> >>> somewhat matches the proposed format. So if anyone wants to try out
> >>> >>> the process...
> >>> >>>
> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <
> c...@koeninger.org>
> >>> >>> wrote:
> >>> >>> > Now that spark summit europe is over, are any committers
> interested
> >>> >>> > in
> >>> >>> > moving forward with this?
> >>> >>> >
> >>> >>> >
> >>> >>> >
> >>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
> >>> >>> >
> >>> >>> > Or are we going to let this discussion die on the vine?
> >>> >>> >
> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
> >>> >>> > <tomasz.gaw...@outlook.com> wrote:
> >>> >>> >> Maybe my mail was not clear enough.
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> I didn't want to write "lets focus on Flink" or any other
> >>> >>> >> framework.
> >>> >>> >> The
> >>> >>> >> idea with benchmarks was to show two things:
> >>> >>> >>
> >>> >>> >> - why some people are doing bad PR for Spark
> >>> >>> >>
> >>> >>> >> - how - in easy way - we can change it and show that Spark is
> >>> >>> >> still on
> >>> >>> >> the
> >>> >>> >> top
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> No more, no less. Benchmarks will be helpful, but I don't think
> >>> >>> >> they're the
> >>> >>> >> most important thing in Spark :) On the Spark main page there is
> >>> >>> >> still
> >>> >>> >> chart
> >>> >>> >> "Spark vs Hadoop". It is important to show that framework is not
> >>> >>> >> the
> >>> >>> >> same
> >>> >>> >> Spark with other API, but much faster and optimized, comparable
> or
> >>> >>> >> even
> >>> >>> >> faster than other frameworks.
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> About real-time streaming, I think it would be just good to see
> it
> >>> >>> >> in
> >>> >>> >> Spark.
> >>> >>> >> I very like current Spark model, but many voices that says "we
> >>> >>> >> need
> >>> >>> >> more" -
> >>> >>> >> community should listen also them and try to help them. With
> SIPs
> >>> >>> >> it
> >>> >>> >> would
> >>> >>> >> be easier, I've just posted this example as "thing that may be
> >>> >>> >> changed
> >>> >>> >> with
> >>> >>> >> SIP".
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> I very like unification via Datasets, but there is a lot of
> >>> >>> >> algorithms
> >>> >>> >> inside - let's make easy API, but with strong background
> >>> >>> >> (articles,
> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark is still
> >>> >>> >> modern
> >>> >>> >> framework.
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> Maybe now my intention will be clearer :) As I said
> organizational
> >>> >>> >> ideas
> >>> >>> >> were already mentioned and I agree with them, my mail was just
> to
> >>> >>> >> show
> >>> >>> >> some
> >>> >>> >> aspects from my side, so from theside of developer and person
> who
> >>> >>> >> is
> >>> >>> >> trying
> >>> >>> >> to help others with Spark (via StackOverflow or other ways)
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> Pozdrawiam / Best regards,
> >>> >>> >>
> >>> >>> >> Tomasz
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> ________________________________
> >>> >>> >> Od: Cody Koeninger <c...@koeninger.org>
> >>> >>> >> Wysłane: 17 października 2016 16:46
> >>> >>> >> Do: Debasish Das
> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
> >>> >>> >> Temat: Re: Spark Improvement Proposals
> >>> >>> >>
> >>> >>> >> I think narrowly focusing on Flink or benchmarks is missing my
> >>> >>> >> point.
> >>> >>> >>
> >>> >>> >> My point is evolve or die.  Spark's governance and organization
> is
> >>> >>> >> hampering its ability to evolve technologically, and it needs to
> >>> >>> >> change.
> >>> >>> >>
> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
> >>> >>> >> <debasish.da...@gmail.com>
> >>> >>> >> wrote:
> >>> >>> >>> Thanks Cody for bringing up a valid point...I picked up Spark
> in
> >>> >>> >>> 2014
> >>> >>> >>> as
> >>> >>> >>> soon as I looked into it since compared to writing Java
> >>> >>> >>> map-reduce
> >>> >>> >>> and
> >>> >>> >>> Cascading code, Spark made writing distributed code fun...But
> now
> >>> >>> >>> as
> >>> >>> >>> we
> >>> >>> >>> went
> >>> >>> >>> deeper with Spark and real-time streaming use-case gets more
> >>> >>> >>> prominent, I
> >>> >>> >>> think it is time to bring a messaging model in conjunction with
> >>> >>> >>> the
> >>> >>> >>> batch/micro-batch API that Spark is good at....akka-streams
> close
> >>> >>> >>> integration with spark micro-batching APIs looks like a great
> >>> >>> >>> direction to
> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0 integrated
> >>> >>> >>> streaming
> >>> >>> >>> with
> >>> >>> >>> batch with the assumption is that micro-batching is sufficient
> to
> >>> >>> >>> run
> >>> >>> >>> SQL
> >>> >>> >>> commands on stream but do we really have time to do SQL
> >>> >>> >>> processing at
> >>> >>> >>> streaming data within 1-2 seconds ?
> >>> >>> >>>
> >>> >>> >>> After reading the email chain, I started to look into Flink
> >>> >>> >>> documentation
> >>> >>> >>> and if you compare it with Spark documentation, I think we have
> >>> >>> >>> major
> >>> >>> >>> work
> >>> >>> >>> to do detailing out Spark internals so that more people from
> >>> >>> >>> community
> >>> >>> >>> start
> >>> >>> >>> to take active role in improving the issues so that Spark stays
> >>> >>> >>> strong
> >>> >>> >>> compared to Flink.
> >>> >>> >>>
> >>> >>> >>> https://cwiki.apache.org/confluence/display/SPARK/
> Spark+Internals
> >>> >>> >>>
> >>> >>> >>> https://cwiki.apache.org/confluence/display/FLINK/
> Flink+Internals
> >>> >>> >>>
> >>> >>> >>> Spark is no longer an engine that works for micro-batch and
> >>> >>> >>> batch...We
> >>> >>> >>> (and
> >>> >>> >>> I am sure many others) are pushing spark as an engine for
> stream
> >>> >>> >>> and
> >>> >>> >>> query
> >>> >>> >>> processing.....we need to make it a state-of-the-art engine for
> >>> >>> >>> high
> >>> >>> >>> speed
> >>> >>> >>> streaming data and user queries as well !
> >>> >>> >>>
> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
> >>> >>> >>> <tomasz.gaw...@outlook.com>
> >>> >>> >>> wrote:
> >>> >>> >>>>
> >>> >>> >>>> Hi everyone,
> >>> >>> >>>>
> >>> >>> >>>> I'm quite late with my answer, but I think my suggestions may
> >>> >>> >>>> help a
> >>> >>> >>>> little bit. :) Many technical and organizational topics were
> >>> >>> >>>> mentioned,
> >>> >>> >>>> but I want to focus on these negative posts about Spark and
> >>> >>> >>>> about
> >>> >>> >>>> "haters"
> >>> >>> >>>>
> >>> >>> >>>> I really like Spark. Easy of use, speed, very good community -
> >>> >>> >>>> it's
> >>> >>> >>>> everything here. But Every project has to "flight" on
> "framework
> >>> >>> >>>> market"
> >>> >>> >>>> to be still no 1. I'm following many Spark and Big Data
> >>> >>> >>>> communities,
> >>> >>> >>>> maybe my mail will inspire someone :)
> >>> >>> >>>>
> >>> >>> >>>> You (every Spark developer; so far I didn't have enough time
> to
> >>> >>> >>>> join
> >>> >>> >>>> contributing to Spark) has done excellent job. So why are some
> >>> >>> >>>> people
> >>> >>> >>>> saying that Flink (or other framework) is better, like it was
> >>> >>> >>>> posted
> >>> >>> >>>> in
> >>> >>> >>>> this mailing list? No, not because that framework is better in
> >>> >>> >>>> all
> >>> >>> >>>> cases.. In my opinion, many of these discussions where started
> >>> >>> >>>> after
> >>> >>> >>>> Flink marketing-like posts. Please look at StackOverflow
> "Flink
> >>> >>> >>>> vs
> >>> >>> >>>> ...."
> >>> >>> >>>> posts, almost every post in "winned" by Flink. Answers are
> >>> >>> >>>> sometimes
> >>> >>> >>>> saying nothing about other frameworks, Flink's users (often
> >>> >>> >>>> PMC's)
> >>> >>> >>>> are
> >>> >>> >>>> just posting same information about real-time streaming, about
> >>> >>> >>>> delta
> >>> >>> >>>> iterations, etc. It look smart and very often it is marked as
> an
> >>> >>> >>>> aswer,
> >>> >>> >>>> even if - in my opinion - there wasn't told all the truth.
> >>> >>> >>>>
> >>> >>> >>>>
> >>> >>> >>>> My suggestion: I don't have enough money and knowledgle to
> >>> >>> >>>> perform
> >>> >>> >>>> huge
> >>> >>> >>>> performance test. Maybe some company, that supports Spark
> >>> >>> >>>> (Databricks,
> >>> >>> >>>> Cloudera? - just saying you're most visible in community :) )
> >>> >>> >>>> could
> >>> >>> >>>> perform performance test of:
> >>> >>> >>>>
> >>> >>> >>>> - streaming engine - probably Spark will loose because of
> >>> >>> >>>> mini-batch
> >>> >>> >>>> model, however currently the difference should be much lower
> >>> >>> >>>> that in
> >>> >>> >>>> previous versions
> >>> >>> >>>>
> >>> >>> >>>> - Machine Learning models
> >>> >>> >>>>
> >>> >>> >>>> - batch jobs
> >>> >>> >>>>
> >>> >>> >>>> - Graph jobs
> >>> >>> >>>>
> >>> >>> >>>> - SQL queries
> >>> >>> >>>>
> >>> >>> >>>> People will see that Spark is envolving and is also a modern
> >>> >>> >>>> framework,
> >>> >>> >>>> because after reading posts mentioned above people may think
> "it
> >>> >>> >>>> is
> >>> >>> >>>> outdated, future is in framework X".
> >>> >>> >>>>
> >>> >>> >>>> Matei Zaharia posted excellent blog post about how Spark
> >>> >>> >>>> Structured
> >>> >>> >>>> Streaming beats every other framework in terms of easy-of-use
> >>> >>> >>>> and
> >>> >>> >>>> reliability. Performance tests, done in various environments
> (in
> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster,
> 20-node
> >>> >>> >>>> cluster), could be also very good marketing stuff to say "hey,
> >>> >>> >>>> you're
> >>> >>> >>>> telling that you're better, but Spark is still faster and is
> >>> >>> >>>> still
> >>> >>> >>>> getting even more fast!". This would be based on facts (just
> >>> >>> >>>> numbers),
> >>> >>> >>>> not opinions. It would be good for companies, for marketing
> >>> >>> >>>> puproses
> >>> >>> >>>> and
> >>> >>> >>>> for every Spark developer
> >>> >>> >>>>
> >>> >>> >>>>
> >>> >>> >>>> Second: real-time streaming. I've written some time ago about
> >>> >>> >>>> real-time
> >>> >>> >>>> streaming support in Spark Structured Streaming. Some work
> >>> >>> >>>> should be
> >>> >>> >>>> done to make SSS more low-latency, but I think it's possible.
> >>> >>> >>>> Maybe
> >>> >>> >>>> Spark may look at Gearpump, which is also built on top of
> Akka?
> >>> >>> >>>> I
> >>> >>> >>>> don't
> >>> >>> >>>> know yet, it is good topic for SIP. However I think that Spark
> >>> >>> >>>> should
> >>> >>> >>>> have real-time streaming support. Currently I see many
> >>> >>> >>>> posts/comments
> >>> >>> >>>> that "Spark has too big latency". Spark Streaming is doing
> very
> >>> >>> >>>> good
> >>> >>> >>>> jobs with micro-batches, however I think it is possible to add
> >>> >>> >>>> also
> >>> >>> >>>> more
> >>> >>> >>>> real-time processing.
> >>> >>> >>>>
> >>> >>> >>>> Other people said much more and I agree with proposal of SIP.
> >>> >>> >>>> I'm
> >>> >>> >>>> also
> >>> >>> >>>> happy that PMC's are not saying that they will not listen to
> >>> >>> >>>> users,
> >>> >>> >>>> but
> >>> >>> >>>> they really want to make Spark better for every user.
> >>> >>> >>>>
> >>> >>> >>>>
> >>> >>> >>>> What do you think about these two topics? Especially I'm
> looking
> >>> >>> >>>> at
> >>> >>> >>>> Cody
> >>> >>> >>>> (who has started this topic) and PMCs :)
> >>> >>> >>>>
> >>> >>> >>>> Pozdrawiam / Best regards,
> >>> >>> >>>>
> >>> >>> >>>> Tomasz
> >>> >>> >>>>
> >>> >>> >>>>
> >>> >>>
> >>> >>
> >>> >
> >>> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Spark Improvement Proposals

Reply via email to