Re: Spark Improvement Proposals

Nicholas Chammas Fri, 07 Oct 2016 10:33:43 -0700

There are several important discussions happening simultaneously. Should we
perhaps split them up into separate threads? Otherwise it’s really
difficult to follow.


It seems like the discussion about having a more formal “Spark Improvement
Proposal” process should take priority here.

Other discussions that could be fleshed out in separate threads are:

   - Better managing “organic” community contributions (i.e. PRs, JIRA
   issues, etc).
   - Adjusting Spark’s governance model / adding more committers.
   - Discussing / addressing competition to Spark coming out of the Python
   community.

Nick

On Fri, Oct 7, 2016 at 1:04 PM Matei Zaharia [email protected]
<http://mailto:[email protected]> wrote:

I think people misunderstood my comment about trolls a bit -- I'm not
> saying to just dismiss what people say, but to focus on what improves the
> project instead of being upset that people criticize stuff. This stuff
> happens all the time to any project in a "hot" area, as Sean said. I don't
> think there's anyone that wants to stop adding features to streaming for
> example, or stop listening to users, etc, or who thinks the project is
> already perfect (I certainly spend much of my time looking at how to
> improve it).
>
> Just to comment on a few things:
>
> On Oct 7, 2016, at 9:16 AM, Holden Karau <[email protected]> wrote:
>
> First off, thanks Cody for taking the time to put together these proposals
> - I think it has kicked off some wonderful discussion.
>
> I think dismissing people's complaints with Spark as largely trolls does
> us a disservice, it’s important for us to recognize our own shortcomings -
> otherwise we are blind to the weak spots where we need to improve and
> instead focus on new features. Parts of the Python community seem to be
> actively looking for alternatives, and I’d obviously like Spark continue to
> be the place where we come together and collaborate from different
> languages.
>
> I’d be more than happy to do a review of the outstanding Python PRs (I’ve
> been keeping on top of the new ones but largely haven’t looked at the older
> ones) and if there is a committer (maybe Davies or Sean?) who would be able
> to help out with merging them once they are ready that would be awesome.
> I’m at PyData DC this weekend but I’ll also start going through some of the
> older Python JIRAs and seeing if they are still relevant, already fixed, or
> something we are unlikely to be interested in bringing into Spark.
>
>
> It would be great to also hear why people are looking for other stuff at a
> high level -- are there just many small issues in Python, or are there some
> bigger things missing? For example, one thing I'd like to see is easy
> installation of PySpark using pip install pyspark. Another idea would be
> making startup time and initialization easy enough that people use Spark
> regularly on a single machine, as a replacement for multiprocessing.
>
> - Design.
> Yes, design by committee doesn't work.  The best designs are when a
> person who understands the problem builds something that works for
> them, shares with others, and most importantly iterates when it
> doesn't work for others.  This iteration only works if you're willing
> to change interfaces, but committer and user goals are not aligned
> here.  Users want something that is clearly documented and helps them
> get their job done.  Committers (not all) want to minimize interface
> change, even at the expense of users being able to do their jobs.  In
> this situation, it is critical that you understand early what users
> need to be able to do.  This is what the improvement proposal process
> should focus on: Goals, non-goals, possible solutions, rejected
> solutions.  Not class-level design.  Most importantly, it needs a
> clear, unambiguous outcome that is visible to the public.
>
>
> Love the idea of a more visible "Spark Improvement Proposal" process that
> solicits user input on new APIs. For what it's worth, I don't think
> committers are trying to minimize their own work -- every committer cares
> about making the software useful for users. However, it is always hard to
> get user input and so it helps to have this kind of process. I've certainly
> looked at the *IPs a lot in other software I use just to see the biggest
> things on the roadmap.
>
> When you're talking about "changing interfaces", are you talking about
> public or internal APIs? I do think many people hate changing public APIs
> and I actually think that's for the best of the project. That's a technical
> debate, but basically, the worst thing when you're using a piece of
> software is that the developers constantly ask you to rewrite your app to
> update to a new version (and thus benefit from bug fixes, etc). Cue anyone
> who's used Protobuf, or Guava. The "let's get everyone to change their code
> this release" model works well within a single large company, but doesn't
> work well for a community, which is why nearly all *very* widely used
> programming interfaces (I'm talking things like Java standard library,
> Windows API, etc) almost *never* break backwards compatibility. All this is
> done within reason though, e.g. we do change things in major releases (2.x,
> 3.x, etc).
>
> - Trolling
> It's not just trolling.  Event time and kafka are technically
> important and should not be ignored.  I've been banging this drum for
> years.  These concerns haven't been fully heard and understood by
> committers.  This one example of why diversity of enfranchised users
> is important and governance concerns shouldn't be ignored.
>
>
> I agree about empowering people interested here to contribute, but I'm
> wondering, do you think there are technical things that people don't want
> to work on, or is it a matter of what there's been time to do? Everyone I
> know does want great Kafka support, event time, etc, it's just a question
> of working out the details and of course of getting the coding done. This
> is also an area where I'd love to see more contributions -- in the past,
> people have dome similar-scale contributions in other areas (e.g. better
> integration with Hive, on-the-wire encryption, etc).
>
> FWIW, I think there are three things going on with streaming.
>
> 1) Structured Streaming, which is meant to provide a much higher-level new
> API. This was meant from the beginning to include event time, various
> complex form of windows, and great data source and sink support in a
> unified framework. It's also, IMHO, much simpler than most existing APIs
> for this stuff (i.e. look at the number of concepts you have to learn for
> those versus for this). However, this project is still very early on --
> only the bare minimum API came out in 2.0. It's marked as alpha and it's
> precisely the type of system where I'd expect the API to improve in
> response to feedback. As with other APIs, such as Spark SQL's SchemaRDD and
> DataFrame, I think it's good to get it in front of *users* quickly and
> receive feedback -- even developers discussing among themselves can't
> anticipate all user needs.
>
> 2) Adding things in Spark Streaming. I haven't personally worked much on
> this lately, but it is a very reasonable thing that I'd love to see the
> project do to help current users. For example, consider adding an
> aggregate-by-event-time operator to Spark Streaming (it can be done using
> mapWithState), or a sessionization operator, etc.
>
> 3) Another thing that I think is possible is just lowering the latency of
> both Spark Streaming and Structured Streaming by 10x -- a few folks at
> Berkeley have been working on this (
> https://spark-summit.org/2016/events/low-latency-execution-for-apache-spark/).
> Happy to fork off a thread about how to do it. Their current system
> requires some new concepts in the Spark scheduler, but from measuring stuff
> it also seems that you can get somewhere with less intensive changes (most
> of the overhead is in RPCs, not in the scheduling logic or task execution).
>
> - Jira
> Concretely, automate closing stale jiras after X amount of time.  It's
> really surprising to me how much reluctance a community of programmers
> have shown towards automating their own processes around stuff like
> this (not to mention automatic code formatting of modified files).  I
> understand the arguments against. but the current alternative doesn't
> work.
> Concretely, clearly reject and close jiras.  I have a backlog of 50+
> kafka jiras, many of which are irrelevant at this point, but I do not
> feel that I have the political power to close them.
> Concretely, make it clear who is working on something.  This can be as
> simple as just "I'm working on this", assign it to me, if I don't
> follow up in X amount of time, close it or reassign.  That doesn't
> mean there can't be competing work, but it does mean those people
> should talk to each other.  Conversely, if committers currently don't
> have time to work on something that is important, make that clear in
> the ticket.
>
>
> Definitely agree with marking who's working on something early on, and
> timing it out if inactive. For closing JIRAs, I think the best way I've
> seen is for people to go through them once in a while. Automated closing is
> too impersonal IMO -- if I opened a JIRA on a project and nobody looked at
> it and that happened to me, I'd actively feel ignored. If you do that,
> you'll see people on stage saying "I reported a bug for Spark and some bot
> just closed it after 3 months", which is not ideal.
>
> Matei
>
>
>
>
> On Fri, Oct 7, 2016 at 5:34 AM, Sean Owen <[email protected]> wrote:
> > Suggestion actions way at the bottom.
> >
> > On Fri, Oct 7, 2016 at 5:14 AM Matei Zaharia <[email protected]>
> > wrote:
> >>
> >> since March. But it's true that other things such as the Kafka source
> for
> >> it didn't have as much design on JIRA. Nonetheless, this component is
> still
> >> early on and there's still a lot of time to change it, which is
> happening.
> >
> >
> > It's hard to drive design discussions in OSS. Even when diligently
> > publishing design docs, the doc happens after brainstorming, and that
> > happens inside someone's head or in chats.
> >
> > The lazy consensus model that works for small changes doesn't work well
> > here. If a committer wants a change, that change will basically be made
> > modulo small edits; vetoes are for dire disagreement. (Otherwise we'd get
> > nothing done.) However this model means it's hard to significantly
> change a
> > design after draft 1.
> >
> > I've heard this complaint a few times, and it has never been down to bad
> > faith. We should err further towards over-including early and often. I've
> > seen some great discussions start more with a problem statement and an
> RFC,
> > not a design doc. Keeping regular contributors enfranchised is
> essential, so
> > that they're willing and able to participate when design time comes. (See
> > below.)
> >
> >
> >>
> >> 2) About what people say at Reactive Summit -- there will always be
> >> trolls, but just ignore them and build a great project. Those of us
> involved
> >> in the project for a while have long seen similar stuff, e.g. a
> >
> >
> > The hype cycle may be turning against Spark, as is normal for this stage
> of
> > maturity. People idealize technologies they don't really use as greener
> > grass; it's the things they use and need to work that they love to hate.
> >
> > I would not dismiss this as just trolling. Customer anecdotes I see
> suggest
> > that Spark underperforms their (inflated) expectations, and generally
> does
> > not Just Work. It takes expertise, tuning, patience, workarounds. And
> then
> > it gets great things done. I do see a gap between how the group here
> talks
> > about the technology, and how the users I see talk about it. The gap
> > manifests in attention given to making yet more things, and attention
> given
> > to fixing and project mechanics.
> >
> > I would also not dismiss criticism of governance. We can recognize some
> big
> > problems that were resolved over even the past 3 months. Usually I hear,
> > well, we do better than most projects, right? and that is true. But,
> Spark
> > is bigger and busier than most any other project. Exceptional projects
> need
> > exceptional governance and we have merely "good". See next.
> >
> >
> >> 3) About number and diversity of committers -- the PMC is always working
> >> to expand these, and you should email people on the PMC (or even the
> whole
> >> list) if you have people you'd like to propose. In
> >
> >
> > If you're suggesting that it's mostly a matter of asking, then this
> doesn't
> > match my experience. I have seen a few people consistently soft-reject
> most
> > proposals. The reasons given usually sound like "concerns about quality",
> > which is probably the right answer to a somewhat wrong question.
> >
> > We should probably be asking primarily who will net-net add efficiency to
> > some part of the project's mechanics. Per above, it wouldn't hurt to ask
> who
> > would expand coverage and add diversity of perspective too.
> >
> > I disagree that committers are being added at a sufficient rate. The
> overall
> > committer-attention hours is dropping as the project grows -- am I the
> only
> > one that perceives many regular committers aren't working nearly as much
> as
> > before on the project?
> >
> > I call it a problem because we have IMHO people who 'qualify', and not
> > giving them some stake is going to cost the project down the road.
> Always Be
> > Recruiting. This is what I would worry about, since the governance and
> > enfranchisement issues above kind of stem from this.
> >
> >
> >>
> >> 4) Finally, about better organizing JIRA, marking dead issues, etc, this
> >> would be great and I think we just need a concrete proposal for how to
> do
> >> it. It would be best to point to an existing process that someone else
> has
> >> used here BTW so that we can see it in action.
> >
> >
> > I don't think we're wanting for proposals. I went on and on about it last
> > year, and don't think anyone disagreed about actions. I wouldn't suggest
> > that clearing out dead issues is more complex than just putting in time
> to
> > do it. It's just grunt work and understandably not appealing. (Thank you
> > Xiao for your recent run at SQL JIRAs.)
> >
> > It requires saying 'no', which is hard, because it requires some
> conviction.
> > I have encountered reluctance to do this in Spark and think that culture
> > should change. Is it weird to say that a broader group of gatekeepers can
> > actually with more confidence and efficiency tackle the triage issue?
> that
> > pushing back on 'bad' contribution actually increases the rate of 'good'?
> >
> > FWIW I also find the project unpleasant to deal with day to day, mostly
> > because of the scale of the triage, and think we could use all the
> qualified
> > help we can get. I am looking to do less with the project over time,
> which
> > is no big deal in itself, but is a big deal if these several factors are
> > adding up to discourage fresh blood from joining the fray. Cody makes me
> > think there are, at least, 2 of us.
> >
> > Concrete steps?
> >
> > Go to spark-prs.com. Look at "Users". Look at your open PRs. Are any
> stale?
> > can you close them or advance them?
> >
> > Look at the Stale PRs tab and sort by last updated. Do any look dead? can
> > you ask the author to update or close? does the parent JIRA look like
> it's
> > not otherwise relevant?
> >
> > Go download JIRA Client at http://almworks.com/jiraclient/download.html
> Go
> > look at all open JIRAs sorted by last update. Are any pretty obviously
> > obsolete?
> >
> > If you don't feel comfortable acting, feel free to at least propose a
> list
> > to dev@ for a look.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>
>
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>
>

Re: Spark Improvement Proposals

Reply via email to