Re: Odp.: Spark Improvement Proposals

Cody Koeninger Mon, 07 Nov 2016 07:53:26 -0800

Thanks for picking up on this.

Maybe I fail at google docs, but I can't see any edits on the document
you linked.


Regarding lazy consensus, if the board in general has less of an issue
with that, sure.  As long as it is clearly announced, lasts at least
72 hours, and has a clear outcome.

The other points are hard to comment on without being able to see the
text in question.


On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <r...@databricks.com> wrote:
> I just looked through the entire thread again tonight - there are a lot of
> great ideas being discussed. Thanks Cody for taking the first crack at the
> proposal.
>
> I want to first comment on the context. Spark is one of the most innovative
> and important projects in (big) data -- overall technical decisions made in
> Apache Spark are sound. But of course, a project as large and active as
> Spark always have room for improvement, and we as a community should strive
> to take it to the next level.
>
> To that end, the two biggest areas for improvements in my opinion are:
>
> 1. Visibility: There are so much happening that it is difficult to know what
> really is going on. For people that don't follow closely, it is difficult to
> know what the important initiatives are. Even for people that do follow, it
> is difficult to know what specific things require their attention, since the
> number of pull requests and JIRA tickets are high and it's difficult to
> extract signal from noise.
>
> 2. Solicit user (broadly defined, including developers themselves) input
> more proactively: At the end of the day the project provides value because
> users use it. Users can't tell us exactly what to build, but it is important
> to get their inputs.
>
>
> I've taken Cody's doc and edited it:
> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
> (I've made all my modifications trackable)
>
> There are couple high level changes I made:
>
> 1. I've consulted a board member and he recommended lazy consensus as
> opposed to voting. The reason being in voting there can easily be a "loser'
> that gets outvoted.
>
> 2. I made it lighter weight, and renamed "strategy" to "optional design
> sketch". Echoing one of the earlier email: "IMHO so far aside from tagging
> things and linking them elsewhere simply having design docs and prototypes
> implementations in PRs is not something that has not worked so far".
>
> 3. I made some the language tweaks to focus more on visibility. For example,
> "The purpose of an SIP is to inform and involve", rather than just
> "involve". SIPs should also have at least two emails that go to dev@.
>
>
> While I was editing this, I thought we really needed a suggested template
> for design doc too. I will get to that too ...
>
>
> On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <r...@databricks.com> wrote:
>>
>> Most things looked OK to me too, although I do plan to take a closer look
>> after Nov 1st when we cut the release branch for 2.1.
>>
>>
>> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin <van...@cloudera.com>
>> wrote:
>>>
>>> The proposal looks OK to me. I assume, even though it's not explicitly
>>> called, that voting would happen by e-mail? A template for the
>>> proposal document (instead of just a bullet nice) would also be nice,
>>> but that can be done at any time.
>>>
>>> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
>>> for a SIP, given the scope of the work. The document attached even
>>> somewhat matches the proposed format. So if anyone wants to try out
>>> the process...
>>>
>>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <c...@koeninger.org>
>>> wrote:
>>> > Now that spark summit europe is over, are any committers interested in
>>> > moving forward with this?
>>> >
>>> >
>>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>> >
>>> > Or are we going to let this discussion die on the vine?
>>> >
>>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>> > <tomasz.gaw...@outlook.com> wrote:
>>> >> Maybe my mail was not clear enough.
>>> >>
>>> >>
>>> >> I didn't want to write "lets focus on Flink" or any other framework.
>>> >> The
>>> >> idea with benchmarks was to show two things:
>>> >>
>>> >> - why some people are doing bad PR for Spark
>>> >>
>>> >> - how - in easy way - we can change it and show that Spark is still on
>>> >> the
>>> >> top
>>> >>
>>> >>
>>> >> No more, no less. Benchmarks will be helpful, but I don't think
>>> >> they're the
>>> >> most important thing in Spark :) On the Spark main page there is still
>>> >> chart
>>> >> "Spark vs Hadoop". It is important to show that framework is not the
>>> >> same
>>> >> Spark with other API, but much faster and optimized, comparable or
>>> >> even
>>> >> faster than other frameworks.
>>> >>
>>> >>
>>> >> About real-time streaming, I think it would be just good to see it in
>>> >> Spark.
>>> >> I very like current Spark model, but many voices that says "we need
>>> >> more" -
>>> >> community should listen also them and try to help them. With SIPs it
>>> >> would
>>> >> be easier, I've just posted this example as "thing that may be changed
>>> >> with
>>> >> SIP".
>>> >>
>>> >>
>>> >> I very like unification via Datasets, but there is a lot of algorithms
>>> >> inside - let's make easy API, but with strong background (articles,
>>> >> benchmarks, descriptions, etc) that shows that Spark is still modern
>>> >> framework.
>>> >>
>>> >>
>>> >> Maybe now my intention will be clearer :) As I said organizational
>>> >> ideas
>>> >> were already mentioned and I agree with them, my mail was just to show
>>> >> some
>>> >> aspects from my side, so from theside of developer and person who is
>>> >> trying
>>> >> to help others with Spark (via StackOverflow or other ways)
>>> >>
>>> >>
>>> >> Pozdrawiam / Best regards,
>>> >>
>>> >> Tomasz
>>> >>
>>> >>
>>> >> ________________________________
>>> >> Od: Cody Koeninger <c...@koeninger.org>
>>> >> Wysłane: 17 października 2016 16:46
>>> >> Do: Debasish Das
>>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>>> >> Temat: Re: Spark Improvement Proposals
>>> >>
>>> >> I think narrowly focusing on Flink or benchmarks is missing my point.
>>> >>
>>> >> My point is evolve or die.  Spark's governance and organization is
>>> >> hampering its ability to evolve technologically, and it needs to
>>> >> change.
>>> >>
>>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>>> >> <debasish.da...@gmail.com>
>>> >> wrote:
>>> >>> Thanks Cody for bringing up a valid point...I picked up Spark in 2014
>>> >>> as
>>> >>> soon as I looked into it since compared to writing Java map-reduce
>>> >>> and
>>> >>> Cascading code, Spark made writing distributed code fun...But now as
>>> >>> we
>>> >>> went
>>> >>> deeper with Spark and real-time streaming use-case gets more
>>> >>> prominent, I
>>> >>> think it is time to bring a messaging model in conjunction with the
>>> >>> batch/micro-batch API that Spark is good at....akka-streams close
>>> >>> integration with spark micro-batching APIs looks like a great
>>> >>> direction to
>>> >>> stay in the game with Apache Flink...Spark 2.0 integrated streaming
>>> >>> with
>>> >>> batch with the assumption is that micro-batching is sufficient to run
>>> >>> SQL
>>> >>> commands on stream but do we really have time to do SQL processing at
>>> >>> streaming data within 1-2 seconds ?
>>> >>>
>>> >>> After reading the email chain, I started to look into Flink
>>> >>> documentation
>>> >>> and if you compare it with Spark documentation, I think we have major
>>> >>> work
>>> >>> to do detailing out Spark internals so that more people from
>>> >>> community
>>> >>> start
>>> >>> to take active role in improving the issues so that Spark stays
>>> >>> strong
>>> >>> compared to Flink.
>>> >>>
>>> >>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>>> >>>
>>> >>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>>> >>>
>>> >>> Spark is no longer an engine that works for micro-batch and
>>> >>> batch...We
>>> >>> (and
>>> >>> I am sure many others) are pushing spark as an engine for stream and
>>> >>> query
>>> >>> processing.....we need to make it a state-of-the-art engine for high
>>> >>> speed
>>> >>> streaming data and user queries as well !
>>> >>>
>>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>>> >>> <tomasz.gaw...@outlook.com>
>>> >>> wrote:
>>> >>>>
>>> >>>> Hi everyone,
>>> >>>>
>>> >>>> I'm quite late with my answer, but I think my suggestions may help a
>>> >>>> little bit. :) Many technical and organizational topics were
>>> >>>> mentioned,
>>> >>>> but I want to focus on these negative posts about Spark and about
>>> >>>> "haters"
>>> >>>>
>>> >>>> I really like Spark. Easy of use, speed, very good community - it's
>>> >>>> everything here. But Every project has to "flight" on "framework
>>> >>>> market"
>>> >>>> to be still no 1. I'm following many Spark and Big Data communities,
>>> >>>> maybe my mail will inspire someone :)
>>> >>>>
>>> >>>> You (every Spark developer; so far I didn't have enough time to join
>>> >>>> contributing to Spark) has done excellent job. So why are some
>>> >>>> people
>>> >>>> saying that Flink (or other framework) is better, like it was posted
>>> >>>> in
>>> >>>> this mailing list? No, not because that framework is better in all
>>> >>>> cases.. In my opinion, many of these discussions where started after
>>> >>>> Flink marketing-like posts. Please look at StackOverflow "Flink vs
>>> >>>> ...."
>>> >>>> posts, almost every post in "winned" by Flink. Answers are sometimes
>>> >>>> saying nothing about other frameworks, Flink's users (often PMC's)
>>> >>>> are
>>> >>>> just posting same information about real-time streaming, about delta
>>> >>>> iterations, etc. It look smart and very often it is marked as an
>>> >>>> aswer,
>>> >>>> even if - in my opinion - there wasn't told all the truth.
>>> >>>>
>>> >>>>
>>> >>>> My suggestion: I don't have enough money and knowledgle to perform
>>> >>>> huge
>>> >>>> performance test. Maybe some company, that supports Spark
>>> >>>> (Databricks,
>>> >>>> Cloudera? - just saying you're most visible in community :) ) could
>>> >>>> perform performance test of:
>>> >>>>
>>> >>>> - streaming engine - probably Spark will loose because of mini-batch
>>> >>>> model, however currently the difference should be much lower that in
>>> >>>> previous versions
>>> >>>>
>>> >>>> - Machine Learning models
>>> >>>>
>>> >>>> - batch jobs
>>> >>>>
>>> >>>> - Graph jobs
>>> >>>>
>>> >>>> - SQL queries
>>> >>>>
>>> >>>> People will see that Spark is envolving and is also a modern
>>> >>>> framework,
>>> >>>> because after reading posts mentioned above people may think "it is
>>> >>>> outdated, future is in framework X".
>>> >>>>
>>> >>>> Matei Zaharia posted excellent blog post about how Spark Structured
>>> >>>> Streaming beats every other framework in terms of easy-of-use and
>>> >>>> reliability. Performance tests, done in various environments (in
>>> >>>> example: laptop, small 2 node cluster, 10-node cluster, 20-node
>>> >>>> cluster), could be also very good marketing stuff to say "hey,
>>> >>>> you're
>>> >>>> telling that you're better, but Spark is still faster and is still
>>> >>>> getting even more fast!". This would be based on facts (just
>>> >>>> numbers),
>>> >>>> not opinions. It would be good for companies, for marketing puproses
>>> >>>> and
>>> >>>> for every Spark developer
>>> >>>>
>>> >>>>
>>> >>>> Second: real-time streaming. I've written some time ago about
>>> >>>> real-time
>>> >>>> streaming support in Spark Structured Streaming. Some work should be
>>> >>>> done to make SSS more low-latency, but I think it's possible. Maybe
>>> >>>> Spark may look at Gearpump, which is also built on top of Akka? I
>>> >>>> don't
>>> >>>> know yet, it is good topic for SIP. However I think that Spark
>>> >>>> should
>>> >>>> have real-time streaming support. Currently I see many
>>> >>>> posts/comments
>>> >>>> that "Spark has too big latency". Spark Streaming is doing very good
>>> >>>> jobs with micro-batches, however I think it is possible to add also
>>> >>>> more
>>> >>>> real-time processing.
>>> >>>>
>>> >>>> Other people said much more and I agree with proposal of SIP. I'm
>>> >>>> also
>>> >>>> happy that PMC's are not saying that they will not listen to users,
>>> >>>> but
>>> >>>> they really want to make Spark better for every user.
>>> >>>>
>>> >>>>
>>> >>>> What do you think about these two topics? Especially I'm looking at
>>> >>>> Cody
>>> >>>> (who has started this topic) and PMCs :)
>>> >>>>
>>> >>>> Pozdrawiam / Best regards,
>>> >>>>
>>> >>>> Tomasz
>>> >>>>
>>> >>>>
>>>
>>
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Odp.: Spark Improvement Proposals

Reply via email to