Re: Odp.: Spark Improvement Proposals

Reynold Xin Tue, 01 Nov 2016 00:10:33 -0700

Most things looked OK to me too, although I do plan to take a closer look
after Nov 1st when we cut the release branch for 2.1.



On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin <[email protected]> wrote:

> The proposal looks OK to me. I assume, even though it's not explicitly
> called, that voting would happen by e-mail? A template for the
> proposal document (instead of just a bullet nice) would also be nice,
> but that can be done at any time.
>
> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
> for a SIP, given the scope of the work. The document attached even
> somewhat matches the proposed format. So if anyone wants to try out
> the process...
>
> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <[email protected]>
> wrote:
> > Now that spark summit europe is over, are any committers interested in
> > moving forward with this?
> >
> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
> >
> > Or are we going to let this discussion die on the vine?
> >
> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
> > <[email protected]> wrote:
> >> Maybe my mail was not clear enough.
> >>
> >>
> >> I didn't want to write "lets focus on Flink" or any other framework. The
> >> idea with benchmarks was to show two things:
> >>
> >> - why some people are doing bad PR for Spark
> >>
> >> - how - in easy way - we can change it and show that Spark is still on
> the
> >> top
> >>
> >>
> >> No more, no less. Benchmarks will be helpful, but I don't think they're
> the
> >> most important thing in Spark :) On the Spark main page there is still
> chart
> >> "Spark vs Hadoop". It is important to show that framework is not the
> same
> >> Spark with other API, but much faster and optimized, comparable or even
> >> faster than other frameworks.
> >>
> >>
> >> About real-time streaming, I think it would be just good to see it in
> Spark.
> >> I very like current Spark model, but many voices that says "we need
> more" -
> >> community should listen also them and try to help them. With SIPs it
> would
> >> be easier, I've just posted this example as "thing that may be changed
> with
> >> SIP".
> >>
> >>
> >> I very like unification via Datasets, but there is a lot of algorithms
> >> inside - let's make easy API, but with strong background (articles,
> >> benchmarks, descriptions, etc) that shows that Spark is still modern
> >> framework.
> >>
> >>
> >> Maybe now my intention will be clearer :) As I said organizational ideas
> >> were already mentioned and I agree with them, my mail was just to show
> some
> >> aspects from my side, so from theside of developer and person who is
> trying
> >> to help others with Spark (via StackOverflow or other ways)
> >>
> >>
> >> Pozdrawiam / Best regards,
> >>
> >> Tomasz
> >>
> >>
> >> ________________________________
> >> Od: Cody Koeninger <[email protected]>
> >> Wysłane: 17 października 2016 16:46
> >> Do: Debasish Das
> >> DW: Tomasz Gawęda; [email protected]
> >> Temat: Re: Spark Improvement Proposals
> >>
> >> I think narrowly focusing on Flink or benchmarks is missing my point.
> >>
> >> My point is evolve or die.  Spark's governance and organization is
> >> hampering its ability to evolve technologically, and it needs to
> >> change.
> >>
> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das <[email protected]
> >
> >> wrote:
> >>> Thanks Cody for bringing up a valid point...I picked up Spark in 2014
> as
> >>> soon as I looked into it since compared to writing Java map-reduce and
> >>> Cascading code, Spark made writing distributed code fun...But now as we
> >>> went
> >>> deeper with Spark and real-time streaming use-case gets more
> prominent, I
> >>> think it is time to bring a messaging model in conjunction with the
> >>> batch/micro-batch API that Spark is good at....akka-streams close
> >>> integration with spark micro-batching APIs looks like a great
> direction to
> >>> stay in the game with Apache Flink...Spark 2.0 integrated streaming
> with
> >>> batch with the assumption is that micro-batching is sufficient to run
> SQL
> >>> commands on stream but do we really have time to do SQL processing at
> >>> streaming data within 1-2 seconds ?
> >>>
> >>> After reading the email chain, I started to look into Flink
> documentation
> >>> and if you compare it with Spark documentation, I think we have major
> work
> >>> to do detailing out Spark internals so that more people from community
> >>> start
> >>> to take active role in improving the issues so that Spark stays strong
> >>> compared to Flink.
> >>>
> >>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
> >>>
> >>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
> >>>
> >>> Spark is no longer an engine that works for micro-batch and batch...We
> >>> (and
> >>> I am sure many others) are pushing spark as an engine for stream and
> query
> >>> processing.....we need to make it a state-of-the-art engine for high
> speed
> >>> streaming data and user queries as well !
> >>>
> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda <
> [email protected]>
> >>> wrote:
> >>>>
> >>>> Hi everyone,
> >>>>
> >>>> I'm quite late with my answer, but I think my suggestions may help a
> >>>> little bit. :) Many technical and organizational topics were
> mentioned,
> >>>> but I want to focus on these negative posts about Spark and about
> >>>> "haters"
> >>>>
> >>>> I really like Spark. Easy of use, speed, very good community - it's
> >>>> everything here. But Every project has to "flight" on "framework
> market"
> >>>> to be still no 1. I'm following many Spark and Big Data communities,
> >>>> maybe my mail will inspire someone :)
> >>>>
> >>>> You (every Spark developer; so far I didn't have enough time to join
> >>>> contributing to Spark) has done excellent job. So why are some people
> >>>> saying that Flink (or other framework) is better, like it was posted
> in
> >>>> this mailing list? No, not because that framework is better in all
> >>>> cases.. In my opinion, many of these discussions where started after
> >>>> Flink marketing-like posts. Please look at StackOverflow "Flink vs
> ...."
> >>>> posts, almost every post in "winned" by Flink. Answers are sometimes
> >>>> saying nothing about other frameworks, Flink's users (often PMC's) are
> >>>> just posting same information about real-time streaming, about delta
> >>>> iterations, etc. It look smart and very often it is marked as an
> aswer,
> >>>> even if - in my opinion - there wasn't told all the truth.
> >>>>
> >>>>
> >>>> My suggestion: I don't have enough money and knowledgle to perform
> huge
> >>>> performance test. Maybe some company, that supports Spark (Databricks,
> >>>> Cloudera? - just saying you're most visible in community :) ) could
> >>>> perform performance test of:
> >>>>
> >>>> - streaming engine - probably Spark will loose because of mini-batch
> >>>> model, however currently the difference should be much lower that in
> >>>> previous versions
> >>>>
> >>>> - Machine Learning models
> >>>>
> >>>> - batch jobs
> >>>>
> >>>> - Graph jobs
> >>>>
> >>>> - SQL queries
> >>>>
> >>>> People will see that Spark is envolving and is also a modern
> framework,
> >>>> because after reading posts mentioned above people may think "it is
> >>>> outdated, future is in framework X".
> >>>>
> >>>> Matei Zaharia posted excellent blog post about how Spark Structured
> >>>> Streaming beats every other framework in terms of easy-of-use and
> >>>> reliability. Performance tests, done in various environments (in
> >>>> example: laptop, small 2 node cluster, 10-node cluster, 20-node
> >>>> cluster), could be also very good marketing stuff to say "hey, you're
> >>>> telling that you're better, but Spark is still faster and is still
> >>>> getting even more fast!". This would be based on facts (just numbers),
> >>>> not opinions. It would be good for companies, for marketing puproses
> and
> >>>> for every Spark developer
> >>>>
> >>>>
> >>>> Second: real-time streaming. I've written some time ago about
> real-time
> >>>> streaming support in Spark Structured Streaming. Some work should be
> >>>> done to make SSS more low-latency, but I think it's possible. Maybe
> >>>> Spark may look at Gearpump, which is also built on top of Akka? I
> don't
> >>>> know yet, it is good topic for SIP. However I think that Spark should
> >>>> have real-time streaming support. Currently I see many posts/comments
> >>>> that "Spark has too big latency". Spark Streaming is doing very good
> >>>> jobs with micro-batches, however I think it is possible to add also
> more
> >>>> real-time processing.
> >>>>
> >>>> Other people said much more and I agree with proposal of SIP. I'm also
> >>>> happy that PMC's are not saying that they will not listen to users,
> but
> >>>> they really want to make Spark better for every user.
> >>>>
> >>>>
> >>>> What do you think about these two topics? Especially I'm looking at
> Cody
> >>>> (who has started this topic) and PMCs :)
> >>>>
> >>>> Pozdrawiam / Best regards,
> >>>>
> >>>> Tomasz
> >>>>
> >>>>
> >>>> W dniu 2016-10-07 o 04:51, Cody Koeninger pisze:
> >>>> > I love Spark.  3 or 4 years ago it was the first distributed
> computing
> >>>> > environment that felt usable, and the community was welcoming.
> >>>> >
> >>>> > But I just got back from the Reactive Summit, and this is what I
> >>>> > observed:
> >>>> >
> >>>> > - Industry leaders on stage making fun of Spark's streaming model
> >>>> > - Open source project leaders saying they looked at Spark's
> governance
> >>>> > as a model to avoid
> >>>> > - Users saying they chose Flink because it was technically superior
> >>>> > and they couldn't get any answers on the Spark mailing lists
> >>>> >
> >>>> > Whether you agree with the substance of any of this, when this stuff
> >>>> > gets repeated enough people will believe it.
> >>>> >
> >>>> > Right now Spark is suffering from its own success, and I think
> >>>> > something needs to change.
> >>>> >
> >>>> > - We need a clear process for planning significant changes to the
> >>>> > codebase.
> >>>> > I'm not saying you need to adopt Kafka Improvement Proposals
> exactly,
> >>>> > but you need a documented process with a clear outcome (e.g. a
> vote).
> >>>> > Passing around google docs after an implementation has largely been
> >>>> > decided on doesn't cut it.
> >>>> >
> >>>> > - All technical communication needs to be public.
> >>>> > Things getting decided in private chat, or when 1/3 of the
> committers
> >>>> > work for the same company and can just talk to each other...
> >>>> > Yes, it's convenient, but it's ultimately detrimental to the health
> of
> >>>> > the project.
> >>>> > The way structured streaming has played out has shown that there are
> >>>> > significant technical blind spots (myself included).
> >>>> > One way to address that is to get the people who have domain
> knowledge
> >>>> > involved, and listen to them.
> >>>> >
> >>>> > - We need more committers, and more committer diversity.
> >>>> > Per committer there are, what, more than 20 contributors and 10 new
> >>>> > jira tickets a month?  It's too much.
> >>>> > There are people (I am _not_ referring to myself) who have been
> around
> >>>> > for years, contributed thousands of lines of code, helped educate
> the
> >>>> > public around Spark... and yet are never going to be voted in.
> >>>> >
> >>>> > - We need a clear process for managing volunteer work.
> >>>> > Too many tickets sit around unowned, unclosed, uncertain.
> >>>> > If someone proposed something and it isn't up to snuff, tell them
> and
> >>>> > close it.  It may be blunt, but it's clearer than "silent no".
> >>>> > If someone wants to work on something, let them own the ticket and
> set
> >>>> > a deadline. If they don't meet it, close it or reassign it.
> >>>> >
> >>>> > This is not me putting on an Apache Bureaucracy hat.  This is me
> >>>> > saying, as a fellow hacker and loyal dissenter, something is wrong
> >>>> > with the culture and process.
> >>>> >
> >>>> > Please, let's change it.
> >>>> >
> >>>> > ------------------------------------------------------------
> ---------
> >>>> > To unsubscribe e-mail: [email protected]
> >>>> >
> >>>
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: [email protected]
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: [email protected]
> >
>
>
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>

Re: Odp.: Spark Improvement Proposals

Reply via email to