Re: Spark Improvement Proposals

Matei Zaharia Fri, 07 Oct 2016 10:04:26 -0700

I think people misunderstood my comment about trolls a bit -- I'm not saying to 
just dismiss what people say, but to focus on what improves the project instead 
of being upset that people criticize stuff. This stuff happens all the time to 
any project in a "hot" area, as Sean said. I don't think there's anyone that 
wants to stop adding features to streaming for example, or stop listening to 
users, etc, or who thinks the project is already perfect (I certainly spend 
much of my time looking at how to improve it).

Just to comment on a few things:

> On Oct 7, 2016, at 9:16 AM, Holden Karau <hol...@pigscanfly.ca> wrote:
> 
> First off, thanks Cody for taking the time to put together these proposals - 
> I think it has kicked off some wonderful discussion.
> 
> I think dismissing people's complaints with Spark as largely trolls does us a 
> disservice, it’s important for us to recognize our own shortcomings - 
> otherwise we are blind to the weak spots where we need to improve and instead 
> focus on new features. Parts of the Python community seem to be actively 
> looking for alternatives, and I’d obviously like Spark continue to be the 
> place where we come together and collaborate from different languages.
> 
> I’d be more than happy to do a review of the outstanding Python PRs (I’ve 
> been keeping on top of the new ones but largely haven’t looked at the older 
> ones) and if there is a committer (maybe Davies or Sean?) who would be able 
> to help out with merging them once they are ready that would be awesome. I’m 
> at PyData DC this weekend but I’ll also start going through some of the older 
> Python JIRAs and seeing if they are still relevant, already fixed, or 
> something we are unlikely to be interested in bringing into Spark.

It would be great to also hear why people are looking for other stuff at a high 
level -- are there just many small issues in Python, or are there some bigger 
things missing? For example, one thing I'd like to see is easy installation of 
PySpark using pip install pyspark. Another idea would be making startup time 
and initialization easy enough that people use Spark regularly on a single 
machine, as a replacement for multiprocessing.

> - Design.
> Yes, design by committee doesn't work.  The best designs are when a
> person who understands the problem builds something that works for
> them, shares with others, and most importantly iterates when it
> doesn't work for others.  This iteration only works if you're willing
> to change interfaces, but committer and user goals are not aligned
> here.  Users want something that is clearly documented and helps them
> get their job done.  Committers (not all) want to minimize interface
> change, even at the expense of users being able to do their jobs.  In
> this situation, it is critical that you understand early what users
> need to be able to do.  This is what the improvement proposal process
> should focus on: Goals, non-goals, possible solutions, rejected
> solutions.  Not class-level design.  Most importantly, it needs a
> clear, unambiguous outcome that is visible to the public.

Love the idea of a more visible "Spark Improvement Proposal" process that 
solicits user input on new APIs. For what it's worth, I don't think committers 
are trying to minimize their own work -- every committer cares about making the 
software useful for users. However, it is always hard to get user input and so 
it helps to have this kind of process. I've certainly looked at the *IPs a lot 
in other software I use just to see the biggest things on the roadmap.

When you're talking about "changing interfaces", are you talking about public 
or internal APIs? I do think many people hate changing public APIs and I 
actually think that's for the best of the project. That's a technical debate, 
but basically, the worst thing when you're using a piece of software is that 
the developers constantly ask you to rewrite your app to update to a new 
version (and thus benefit from bug fixes, etc). Cue anyone who's used Protobuf, 
or Guava. The "let's get everyone to change their code this release" model 
works well within a single large company, but doesn't work well for a 
community, which is why nearly all *very* widely used programming interfaces 
(I'm talking things like Java standard library, Windows API, etc) almost 
*never* break backwards compatibility. All this is done within reason though, 
e.g. we do change things in major releases (2.x, 3.x, etc).

> - Trolling
> It's not just trolling.  Event time and kafka are technically
> important and should not be ignored.  I've been banging this drum for
> years.  These concerns haven't been fully heard and understood by
> committers.  This one example of why diversity of enfranchised users
> is important and governance concerns shouldn't be ignored.

I agree about empowering people interested here to contribute, but I'm 
wondering, do you think there are technical things that people don't want to 
work on, or is it a matter of what there's been time to do? Everyone I know 
does want great Kafka support, event time, etc, it's just a question of working 
out the details and of course of getting the coding done. This is also an area 
where I'd love to see more contributions -- in the past, people have dome 
similar-scale contributions in other areas (e.g. better integration with Hive, 
on-the-wire encryption, etc).

FWIW, I think there are three things going on with streaming.

1) Structured Streaming, which is meant to provide a much higher-level new API. 
This was meant from the beginning to include event time, various complex form 
of windows, and great data source and sink support in a unified framework. It's 
also, IMHO, much simpler than most existing APIs for this stuff (i.e. look at 
the number of concepts you have to learn for those versus for this). However, 
this project is still very early on -- only the bare minimum API came out in 
2.0. It's marked as alpha and it's precisely the type of system where I'd 
expect the API to improve in response to feedback. As with other APIs, such as 
Spark SQL's SchemaRDD and DataFrame, I think it's good to get it in front of 
*users* quickly and receive feedback -- even developers discussing among 
themselves can't anticipate all user needs.

2) Adding things in Spark Streaming. I haven't personally worked much on this 
lately, but it is a very reasonable thing that I'd love to see the project do 
to help current users. For example, consider adding an aggregate-by-event-time 
operator to Spark Streaming (it can be done using mapWithState), or a 
sessionization operator, etc.

3) Another thing that I think is possible is just lowering the latency of both 
Spark Streaming and Structured Streaming by 10x -- a few folks at Berkeley have 
been working on this 
(https://spark-summit.org/2016/events/low-latency-execution-for-apache-spark/). 
Happy to fork off a thread about how to do it. Their current system requires 
some new concepts in the Spark scheduler, but from measuring stuff it also 
seems that you can get somewhere with less intensive changes (most of the 
overhead is in RPCs, not in the scheduling logic or task execution).

> - Jira
> Concretely, automate closing stale jiras after X amount of time.  It's
> really surprising to me how much reluctance a community of programmers
> have shown towards automating their own processes around stuff like
> this (not to mention automatic code formatting of modified files).  I
> understand the arguments against. but the current alternative doesn't
> work.
> Concretely, clearly reject and close jiras.  I have a backlog of 50+
> kafka jiras, many of which are irrelevant at this point, but I do not
> feel that I have the political power to close them.
> Concretely, make it clear who is working on something.  This can be as
> simple as just "I'm working on this", assign it to me, if I don't
> follow up in X amount of time, close it or reassign.  That doesn't
> mean there can't be competing work, but it does mean those people
> should talk to each other.  Conversely, if committers currently don't
> have time to work on something that is important, make that clear in
> the ticket.

Definitely agree with marking who's working on something early on, and timing 
it out if inactive. For closing JIRAs, I think the best way I've seen is for 
people to go through them once in a while. Automated closing is too impersonal 
IMO -- if I opened a JIRA on a project and nobody looked at it and that 
happened to me, I'd actively feel ignored. If you do that, you'll see people on 
stage saying "I reported a bug for Spark and some bot just closed it after 3 
months", which is not ideal.

Matei

> 
> 
> On Fri, Oct 7, 2016 at 5:34 AM, Sean Owen <so...@cloudera.com 
> <mailto:so...@cloudera.com>> wrote:
> > Suggestion actions way at the bottom.
> >
> > On Fri, Oct 7, 2016 at 5:14 AM Matei Zaharia <matei.zaha...@gmail.com 
> > <mailto:matei.zaha...@gmail.com>>
> > wrote:
> >>
> >> since March. But it's true that other things such as the Kafka source for
> >> it didn't have as much design on JIRA. Nonetheless, this component is still
> >> early on and there's still a lot of time to change it, which is happening.
> >
> >
> > It's hard to drive design discussions in OSS. Even when diligently
> > publishing design docs, the doc happens after brainstorming, and that
> > happens inside someone's head or in chats.
> >
> > The lazy consensus model that works for small changes doesn't work well
> > here. If a committer wants a change, that change will basically be made
> > modulo small edits; vetoes are for dire disagreement. (Otherwise we'd get
> > nothing done.) However this model means it's hard to significantly change a
> > design after draft 1.
> >
> > I've heard this complaint a few times, and it has never been down to bad
> > faith. We should err further towards over-including early and often. I've
> > seen some great discussions start more with a problem statement and an RFC,
> > not a design doc. Keeping regular contributors enfranchised is essential, so
> > that they're willing and able to participate when design time comes. (See
> > below.)
> >
> >
> >>
> >> 2) About what people say at Reactive Summit -- there will always be
> >> trolls, but just ignore them and build a great project. Those of us 
> >> involved
> >> in the project for a while have long seen similar stuff, e.g. a
> >
> >
> > The hype cycle may be turning against Spark, as is normal for this stage of
> > maturity. People idealize technologies they don't really use as greener
> > grass; it's the things they use and need to work that they love to hate.
> >
> > I would not dismiss this as just trolling. Customer anecdotes I see suggest
> > that Spark underperforms their (inflated) expectations, and generally does
> > not Just Work. It takes expertise, tuning, patience, workarounds. And then
> > it gets great things done. I do see a gap between how the group here talks
> > about the technology, and how the users I see talk about it. The gap
> > manifests in attention given to making yet more things, and attention given
> > to fixing and project mechanics.
> >
> > I would also not dismiss criticism of governance. We can recognize some big
> > problems that were resolved over even the past 3 months. Usually I hear,
> > well, we do better than most projects, right? and that is true. But, Spark
> > is bigger and busier than most any other project. Exceptional projects need
> > exceptional governance and we have merely "good". See next.
> >
> >
> >> 3) About number and diversity of committers -- the PMC is always working
> >> to expand these, and you should email people on the PMC (or even the whole
> >> list) if you have people you'd like to propose. In
> >
> >
> > If you're suggesting that it's mostly a matter of asking, then this doesn't
> > match my experience. I have seen a few people consistently soft-reject most
> > proposals. The reasons given usually sound like "concerns about quality",
> > which is probably the right answer to a somewhat wrong question.
> >
> > We should probably be asking primarily who will net-net add efficiency to
> > some part of the project's mechanics. Per above, it wouldn't hurt to ask who
> > would expand coverage and add diversity of perspective too.
> >
> > I disagree that committers are being added at a sufficient rate. The overall
> > committer-attention hours is dropping as the project grows -- am I the only
> > one that perceives many regular committers aren't working nearly as much as
> > before on the project?
> >
> > I call it a problem because we have IMHO people who 'qualify', and not
> > giving them some stake is going to cost the project down the road. Always Be
> > Recruiting. This is what I would worry about, since the governance and
> > enfranchisement issues above kind of stem from this.
> >
> >
> >>
> >> 4) Finally, about better organizing JIRA, marking dead issues, etc, this
> >> would be great and I think we just need a concrete proposal for how to do
> >> it. It would be best to point to an existing process that someone else has
> >> used here BTW so that we can see it in action.
> >
> >
> > I don't think we're wanting for proposals. I went on and on about it last
> > year, and don't think anyone disagreed about actions. I wouldn't suggest
> > that clearing out dead issues is more complex than just putting in time to
> > do it. It's just grunt work and understandably not appealing. (Thank you
> > Xiao for your recent run at SQL JIRAs.)
> >
> > It requires saying 'no', which is hard, because it requires some conviction.
> > I have encountered reluctance to do this in Spark and think that culture
> > should change. Is it weird to say that a broader group of gatekeepers can
> > actually with more confidence and efficiency tackle the triage issue? that
> > pushing back on 'bad' contribution actually increases the rate of 'good'?
> >
> > FWIW I also find the project unpleasant to deal with day to day, mostly
> > because of the scale of the triage, and think we could use all the qualified
> > help we can get. I am looking to do less with the project over time, which
> > is no big deal in itself, but is a big deal if these several factors are
> > adding up to discourage fresh blood from joining the fray. Cody makes me
> > think there are, at least, 2 of us.
> >
> > Concrete steps?
> >
> > Go to spark-prs.com <http://spark-prs.com/>. Look at "Users". Look at your 
> > open PRs. Are any stale?
> > can you close them or advance them?
> >
> > Look at the Stale PRs tab and sort by last updated. Do any look dead? can
> > you ask the author to update or close? does the parent JIRA look like it's
> > not otherwise relevant?
> >
> > Go download JIRA Client at http://almworks.com/jiraclient/download.html 
> > <http://almworks.com/jiraclient/download.html> Go
> > look at all open JIRAs sorted by last update. Are any pretty obviously
> > obsolete?
> >
> > If you don't feel comfortable acting, feel free to at least propose a list
> > to dev@ for a look.
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> <mailto:dev-unsubscr...@spark.apache.org>
> 
> 
> 
> 
> -- 
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau <https://twitter.com/holdenkarau>

Re: Spark Improvement Proposals

Reply via email to