Re: Spark Improvement Proposals

Cody Koeninger Tue, 08 Nov 2016 08:15:12 -0800

So there are some minor things (the Where section heading appears to
be dropped; wherever this document is posted it needs to actually link
to a jira filter showing current / past SIPs) but it doesn't look like
I can comment on the google doc.


The major substantive issue that I have is that this version is
significantly less clear as to the outcome of an SIP.

The apache example of lazy consensus at
http://apache.org/foundation/voting.html#LazyConsensus involves an
explicit announcement of an explicit deadline, which I think are
necessary for clarity.



On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <r...@databricks.com> wrote:
> It turned out suggested edits (trackable) don't show up for non-owners, so
> I've just merged all the edits in place. It should be visible now.
>
> On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <r...@databricks.com> wrote:
>>
>> Oops. Let me try figure that out.
>>
>>
>> On Monday, November 7, 2016, Cody Koeninger <c...@koeninger.org> wrote:
>>>
>>> Thanks for picking up on this.
>>>
>>> Maybe I fail at google docs, but I can't see any edits on the document
>>> you linked.
>>>
>>> Regarding lazy consensus, if the board in general has less of an issue
>>> with that, sure.  As long as it is clearly announced, lasts at least
>>> 72 hours, and has a clear outcome.
>>>
>>> The other points are hard to comment on without being able to see the
>>> text in question.
>>>
>>>
>>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <r...@databricks.com> wrote:
>>> > I just looked through the entire thread again tonight - there are a lot
>>> > of
>>> > great ideas being discussed. Thanks Cody for taking the first crack at
>>> > the
>>> > proposal.
>>> >
>>> > I want to first comment on the context. Spark is one of the most
>>> > innovative
>>> > and important projects in (big) data -- overall technical decisions
>>> > made in
>>> > Apache Spark are sound. But of course, a project as large and active as
>>> > Spark always have room for improvement, and we as a community should
>>> > strive
>>> > to take it to the next level.
>>> >
>>> > To that end, the two biggest areas for improvements in my opinion are:
>>> >
>>> > 1. Visibility: There are so much happening that it is difficult to know
>>> > what
>>> > really is going on. For people that don't follow closely, it is
>>> > difficult to
>>> > know what the important initiatives are. Even for people that do
>>> > follow, it
>>> > is difficult to know what specific things require their attention,
>>> > since the
>>> > number of pull requests and JIRA tickets are high and it's difficult to
>>> > extract signal from noise.
>>> >
>>> > 2. Solicit user (broadly defined, including developers themselves)
>>> > input
>>> > more proactively: At the end of the day the project provides value
>>> > because
>>> > users use it. Users can't tell us exactly what to build, but it is
>>> > important
>>> > to get their inputs.
>>> >
>>> >
>>> > I've taken Cody's doc and edited it:
>>> >
>>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
>>> > (I've made all my modifications trackable)
>>> >
>>> > There are couple high level changes I made:
>>> >
>>> > 1. I've consulted a board member and he recommended lazy consensus as
>>> > opposed to voting. The reason being in voting there can easily be a
>>> > "loser'
>>> > that gets outvoted.
>>> >
>>> > 2. I made it lighter weight, and renamed "strategy" to "optional design
>>> > sketch". Echoing one of the earlier email: "IMHO so far aside from
>>> > tagging
>>> > things and linking them elsewhere simply having design docs and
>>> > prototypes
>>> > implementations in PRs is not something that has not worked so far".
>>> >
>>> > 3. I made some the language tweaks to focus more on visibility. For
>>> > example,
>>> > "The purpose of an SIP is to inform and involve", rather than just
>>> > "involve". SIPs should also have at least two emails that go to dev@.
>>> >
>>> >
>>> > While I was editing this, I thought we really needed a suggested
>>> > template
>>> > for design doc too. I will get to that too ...
>>> >
>>> >
>>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <r...@databricks.com>
>>> > wrote:
>>> >>
>>> >> Most things looked OK to me too, although I do plan to take a closer
>>> >> look
>>> >> after Nov 1st when we cut the release branch for 2.1.
>>> >>
>>> >>
>>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin <van...@cloudera.com>
>>> >> wrote:
>>> >>>
>>> >>> The proposal looks OK to me. I assume, even though it's not
>>> >>> explicitly
>>> >>> called, that voting would happen by e-mail? A template for the
>>> >>> proposal document (instead of just a bullet nice) would also be nice,
>>> >>> but that can be done at any time.
>>> >>>
>>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
>>> >>> for a SIP, given the scope of the work. The document attached even
>>> >>> somewhat matches the proposed format. So if anyone wants to try out
>>> >>> the process...
>>> >>>
>>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <c...@koeninger.org>
>>> >>> wrote:
>>> >>> > Now that spark summit europe is over, are any committers interested
>>> >>> > in
>>> >>> > moving forward with this?
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>> >>> >
>>> >>> > Or are we going to let this discussion die on the vine?
>>> >>> >
>>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>> >>> > <tomasz.gaw...@outlook.com> wrote:
>>> >>> >> Maybe my mail was not clear enough.
>>> >>> >>
>>> >>> >>
>>> >>> >> I didn't want to write "lets focus on Flink" or any other
>>> >>> >> framework.
>>> >>> >> The
>>> >>> >> idea with benchmarks was to show two things:
>>> >>> >>
>>> >>> >> - why some people are doing bad PR for Spark
>>> >>> >>
>>> >>> >> - how - in easy way - we can change it and show that Spark is
>>> >>> >> still on
>>> >>> >> the
>>> >>> >> top
>>> >>> >>
>>> >>> >>
>>> >>> >> No more, no less. Benchmarks will be helpful, but I don't think
>>> >>> >> they're the
>>> >>> >> most important thing in Spark :) On the Spark main page there is
>>> >>> >> still
>>> >>> >> chart
>>> >>> >> "Spark vs Hadoop". It is important to show that framework is not
>>> >>> >> the
>>> >>> >> same
>>> >>> >> Spark with other API, but much faster and optimized, comparable or
>>> >>> >> even
>>> >>> >> faster than other frameworks.
>>> >>> >>
>>> >>> >>
>>> >>> >> About real-time streaming, I think it would be just good to see it
>>> >>> >> in
>>> >>> >> Spark.
>>> >>> >> I very like current Spark model, but many voices that says "we
>>> >>> >> need
>>> >>> >> more" -
>>> >>> >> community should listen also them and try to help them. With SIPs
>>> >>> >> it
>>> >>> >> would
>>> >>> >> be easier, I've just posted this example as "thing that may be
>>> >>> >> changed
>>> >>> >> with
>>> >>> >> SIP".
>>> >>> >>
>>> >>> >>
>>> >>> >> I very like unification via Datasets, but there is a lot of
>>> >>> >> algorithms
>>> >>> >> inside - let's make easy API, but with strong background
>>> >>> >> (articles,
>>> >>> >> benchmarks, descriptions, etc) that shows that Spark is still
>>> >>> >> modern
>>> >>> >> framework.
>>> >>> >>
>>> >>> >>
>>> >>> >> Maybe now my intention will be clearer :) As I said organizational
>>> >>> >> ideas
>>> >>> >> were already mentioned and I agree with them, my mail was just to
>>> >>> >> show
>>> >>> >> some
>>> >>> >> aspects from my side, so from theside of developer and person who
>>> >>> >> is
>>> >>> >> trying
>>> >>> >> to help others with Spark (via StackOverflow or other ways)
>>> >>> >>
>>> >>> >>
>>> >>> >> Pozdrawiam / Best regards,
>>> >>> >>
>>> >>> >> Tomasz
>>> >>> >>
>>> >>> >>
>>> >>> >> ________________________________
>>> >>> >> Od: Cody Koeninger <c...@koeninger.org>
>>> >>> >> Wysłane: 17 października 2016 16:46
>>> >>> >> Do: Debasish Das
>>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>>> >>> >> Temat: Re: Spark Improvement Proposals
>>> >>> >>
>>> >>> >> I think narrowly focusing on Flink or benchmarks is missing my
>>> >>> >> point.
>>> >>> >>
>>> >>> >> My point is evolve or die.  Spark's governance and organization is
>>> >>> >> hampering its ability to evolve technologically, and it needs to
>>> >>> >> change.
>>> >>> >>
>>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>>> >>> >> <debasish.da...@gmail.com>
>>> >>> >> wrote:
>>> >>> >>> Thanks Cody for bringing up a valid point...I picked up Spark in
>>> >>> >>> 2014
>>> >>> >>> as
>>> >>> >>> soon as I looked into it since compared to writing Java
>>> >>> >>> map-reduce
>>> >>> >>> and
>>> >>> >>> Cascading code, Spark made writing distributed code fun...But now
>>> >>> >>> as
>>> >>> >>> we
>>> >>> >>> went
>>> >>> >>> deeper with Spark and real-time streaming use-case gets more
>>> >>> >>> prominent, I
>>> >>> >>> think it is time to bring a messaging model in conjunction with
>>> >>> >>> the
>>> >>> >>> batch/micro-batch API that Spark is good at....akka-streams close
>>> >>> >>> integration with spark micro-batching APIs looks like a great
>>> >>> >>> direction to
>>> >>> >>> stay in the game with Apache Flink...Spark 2.0 integrated
>>> >>> >>> streaming
>>> >>> >>> with
>>> >>> >>> batch with the assumption is that micro-batching is sufficient to
>>> >>> >>> run
>>> >>> >>> SQL
>>> >>> >>> commands on stream but do we really have time to do SQL
>>> >>> >>> processing at
>>> >>> >>> streaming data within 1-2 seconds ?
>>> >>> >>>
>>> >>> >>> After reading the email chain, I started to look into Flink
>>> >>> >>> documentation
>>> >>> >>> and if you compare it with Spark documentation, I think we have
>>> >>> >>> major
>>> >>> >>> work
>>> >>> >>> to do detailing out Spark internals so that more people from
>>> >>> >>> community
>>> >>> >>> start
>>> >>> >>> to take active role in improving the issues so that Spark stays
>>> >>> >>> strong
>>> >>> >>> compared to Flink.
>>> >>> >>>
>>> >>> >>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>>> >>> >>>
>>> >>> >>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>>> >>> >>>
>>> >>> >>> Spark is no longer an engine that works for micro-batch and
>>> >>> >>> batch...We
>>> >>> >>> (and
>>> >>> >>> I am sure many others) are pushing spark as an engine for stream
>>> >>> >>> and
>>> >>> >>> query
>>> >>> >>> processing.....we need to make it a state-of-the-art engine for
>>> >>> >>> high
>>> >>> >>> speed
>>> >>> >>> streaming data and user queries as well !
>>> >>> >>>
>>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>>> >>> >>> <tomasz.gaw...@outlook.com>
>>> >>> >>> wrote:
>>> >>> >>>>
>>> >>> >>>> Hi everyone,
>>> >>> >>>>
>>> >>> >>>> I'm quite late with my answer, but I think my suggestions may
>>> >>> >>>> help a
>>> >>> >>>> little bit. :) Many technical and organizational topics were
>>> >>> >>>> mentioned,
>>> >>> >>>> but I want to focus on these negative posts about Spark and
>>> >>> >>>> about
>>> >>> >>>> "haters"
>>> >>> >>>>
>>> >>> >>>> I really like Spark. Easy of use, speed, very good community -
>>> >>> >>>> it's
>>> >>> >>>> everything here. But Every project has to "flight" on "framework
>>> >>> >>>> market"
>>> >>> >>>> to be still no 1. I'm following many Spark and Big Data
>>> >>> >>>> communities,
>>> >>> >>>> maybe my mail will inspire someone :)
>>> >>> >>>>
>>> >>> >>>> You (every Spark developer; so far I didn't have enough time to
>>> >>> >>>> join
>>> >>> >>>> contributing to Spark) has done excellent job. So why are some
>>> >>> >>>> people
>>> >>> >>>> saying that Flink (or other framework) is better, like it was
>>> >>> >>>> posted
>>> >>> >>>> in
>>> >>> >>>> this mailing list? No, not because that framework is better in
>>> >>> >>>> all
>>> >>> >>>> cases.. In my opinion, many of these discussions where started
>>> >>> >>>> after
>>> >>> >>>> Flink marketing-like posts. Please look at StackOverflow "Flink
>>> >>> >>>> vs
>>> >>> >>>> ...."
>>> >>> >>>> posts, almost every post in "winned" by Flink. Answers are
>>> >>> >>>> sometimes
>>> >>> >>>> saying nothing about other frameworks, Flink's users (often
>>> >>> >>>> PMC's)
>>> >>> >>>> are
>>> >>> >>>> just posting same information about real-time streaming, about
>>> >>> >>>> delta
>>> >>> >>>> iterations, etc. It look smart and very often it is marked as an
>>> >>> >>>> aswer,
>>> >>> >>>> even if - in my opinion - there wasn't told all the truth.
>>> >>> >>>>
>>> >>> >>>>
>>> >>> >>>> My suggestion: I don't have enough money and knowledgle to
>>> >>> >>>> perform
>>> >>> >>>> huge
>>> >>> >>>> performance test. Maybe some company, that supports Spark
>>> >>> >>>> (Databricks,
>>> >>> >>>> Cloudera? - just saying you're most visible in community :) )
>>> >>> >>>> could
>>> >>> >>>> perform performance test of:
>>> >>> >>>>
>>> >>> >>>> - streaming engine - probably Spark will loose because of
>>> >>> >>>> mini-batch
>>> >>> >>>> model, however currently the difference should be much lower
>>> >>> >>>> that in
>>> >>> >>>> previous versions
>>> >>> >>>>
>>> >>> >>>> - Machine Learning models
>>> >>> >>>>
>>> >>> >>>> - batch jobs
>>> >>> >>>>
>>> >>> >>>> - Graph jobs
>>> >>> >>>>
>>> >>> >>>> - SQL queries
>>> >>> >>>>
>>> >>> >>>> People will see that Spark is envolving and is also a modern
>>> >>> >>>> framework,
>>> >>> >>>> because after reading posts mentioned above people may think "it
>>> >>> >>>> is
>>> >>> >>>> outdated, future is in framework X".
>>> >>> >>>>
>>> >>> >>>> Matei Zaharia posted excellent blog post about how Spark
>>> >>> >>>> Structured
>>> >>> >>>> Streaming beats every other framework in terms of easy-of-use
>>> >>> >>>> and
>>> >>> >>>> reliability. Performance tests, done in various environments (in
>>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster, 20-node
>>> >>> >>>> cluster), could be also very good marketing stuff to say "hey,
>>> >>> >>>> you're
>>> >>> >>>> telling that you're better, but Spark is still faster and is
>>> >>> >>>> still
>>> >>> >>>> getting even more fast!". This would be based on facts (just
>>> >>> >>>> numbers),
>>> >>> >>>> not opinions. It would be good for companies, for marketing
>>> >>> >>>> puproses
>>> >>> >>>> and
>>> >>> >>>> for every Spark developer
>>> >>> >>>>
>>> >>> >>>>
>>> >>> >>>> Second: real-time streaming. I've written some time ago about
>>> >>> >>>> real-time
>>> >>> >>>> streaming support in Spark Structured Streaming. Some work
>>> >>> >>>> should be
>>> >>> >>>> done to make SSS more low-latency, but I think it's possible.
>>> >>> >>>> Maybe
>>> >>> >>>> Spark may look at Gearpump, which is also built on top of Akka?
>>> >>> >>>> I
>>> >>> >>>> don't
>>> >>> >>>> know yet, it is good topic for SIP. However I think that Spark
>>> >>> >>>> should
>>> >>> >>>> have real-time streaming support. Currently I see many
>>> >>> >>>> posts/comments
>>> >>> >>>> that "Spark has too big latency". Spark Streaming is doing very
>>> >>> >>>> good
>>> >>> >>>> jobs with micro-batches, however I think it is possible to add
>>> >>> >>>> also
>>> >>> >>>> more
>>> >>> >>>> real-time processing.
>>> >>> >>>>
>>> >>> >>>> Other people said much more and I agree with proposal of SIP.
>>> >>> >>>> I'm
>>> >>> >>>> also
>>> >>> >>>> happy that PMC's are not saying that they will not listen to
>>> >>> >>>> users,
>>> >>> >>>> but
>>> >>> >>>> they really want to make Spark better for every user.
>>> >>> >>>>
>>> >>> >>>>
>>> >>> >>>> What do you think about these two topics? Especially I'm looking
>>> >>> >>>> at
>>> >>> >>>> Cody
>>> >>> >>>> (who has started this topic) and PMCs :)
>>> >>> >>>>
>>> >>> >>>> Pozdrawiam / Best regards,
>>> >>> >>>>
>>> >>> >>>> Tomasz
>>> >>> >>>>
>>> >>> >>>>
>>> >>>
>>> >>
>>> >
>>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Spark Improvement Proposals

Reply via email to