There are several important discussions happening simultaneously. Should we perhaps split them up into separate threads? Otherwise it’s really difficult to follow.
It seems like the discussion about having a more formal “Spark Improvement Proposal” process should take priority here. Other discussions that could be fleshed out in separate threads are: - Better managing “organic” community contributions (i.e. PRs, JIRA issues, etc). - Adjusting Spark’s governance model / adding more committers. - Discussing / addressing competition to Spark coming out of the Python community. Nick On Fri, Oct 7, 2016 at 1:04 PM Matei Zaharia matei.zaha...@gmail.com <http://mailto:matei.zaha...@gmail.com> wrote: I think people misunderstood my comment about trolls a bit -- I'm not > saying to just dismiss what people say, but to focus on what improves the > project instead of being upset that people criticize stuff. This stuff > happens all the time to any project in a "hot" area, as Sean said. I don't > think there's anyone that wants to stop adding features to streaming for > example, or stop listening to users, etc, or who thinks the project is > already perfect (I certainly spend much of my time looking at how to > improve it). > > Just to comment on a few things: > > On Oct 7, 2016, at 9:16 AM, Holden Karau <hol...@pigscanfly.ca> wrote: > > First off, thanks Cody for taking the time to put together these proposals > - I think it has kicked off some wonderful discussion. > > I think dismissing people's complaints with Spark as largely trolls does > us a disservice, it’s important for us to recognize our own shortcomings - > otherwise we are blind to the weak spots where we need to improve and > instead focus on new features. Parts of the Python community seem to be > actively looking for alternatives, and I’d obviously like Spark continue to > be the place where we come together and collaborate from different > languages. > > I’d be more than happy to do a review of the outstanding Python PRs (I’ve > been keeping on top of the new ones but largely haven’t looked at the older > ones) and if there is a committer (maybe Davies or Sean?) who would be able > to help out with merging them once they are ready that would be awesome. > I’m at PyData DC this weekend but I’ll also start going through some of the > older Python JIRAs and seeing if they are still relevant, already fixed, or > something we are unlikely to be interested in bringing into Spark. > > > It would be great to also hear why people are looking for other stuff at a > high level -- are there just many small issues in Python, or are there some > bigger things missing? For example, one thing I'd like to see is easy > installation of PySpark using pip install pyspark. Another idea would be > making startup time and initialization easy enough that people use Spark > regularly on a single machine, as a replacement for multiprocessing. > > - Design. > Yes, design by committee doesn't work. The best designs are when a > person who understands the problem builds something that works for > them, shares with others, and most importantly iterates when it > doesn't work for others. This iteration only works if you're willing > to change interfaces, but committer and user goals are not aligned > here. Users want something that is clearly documented and helps them > get their job done. Committers (not all) want to minimize interface > change, even at the expense of users being able to do their jobs. In > this situation, it is critical that you understand early what users > need to be able to do. This is what the improvement proposal process > should focus on: Goals, non-goals, possible solutions, rejected > solutions. Not class-level design. Most importantly, it needs a > clear, unambiguous outcome that is visible to the public. > > > Love the idea of a more visible "Spark Improvement Proposal" process that > solicits user input on new APIs. For what it's worth, I don't think > committers are trying to minimize their own work -- every committer cares > about making the software useful for users. However, it is always hard to > get user input and so it helps to have this kind of process. I've certainly > looked at the *IPs a lot in other software I use just to see the biggest > things on the roadmap. > > When you're talking about "changing interfaces", are you talking about > public or internal APIs? I do think many people hate changing public APIs > and I actually think that's for the best of the project. That's a technical > debate, but basically, the worst thing when you're using a piece of > software is that the developers constantly ask you to rewrite your app to > update to a new version (and thus benefit from bug fixes, etc). Cue anyone > who's used Protobuf, or Guava. The "let's get everyone to change their code > this release" model works well within a single large company, but doesn't > work well for a community, which is why nearly all *very* widely used > programming interfaces (I'm talking things like Java standard library, > Windows API, etc) almost *never* break backwards compatibility. All this is > done within reason though, e.g. we do change things in major releases (2.x, > 3.x, etc). > > - Trolling > It's not just trolling. Event time and kafka are technically > important and should not be ignored. I've been banging this drum for > years. These concerns haven't been fully heard and understood by > committers. This one example of why diversity of enfranchised users > is important and governance concerns shouldn't be ignored. > > > I agree about empowering people interested here to contribute, but I'm > wondering, do you think there are technical things that people don't want > to work on, or is it a matter of what there's been time to do? Everyone I > know does want great Kafka support, event time, etc, it's just a question > of working out the details and of course of getting the coding done. This > is also an area where I'd love to see more contributions -- in the past, > people have dome similar-scale contributions in other areas (e.g. better > integration with Hive, on-the-wire encryption, etc). > > FWIW, I think there are three things going on with streaming. > > 1) Structured Streaming, which is meant to provide a much higher-level new > API. This was meant from the beginning to include event time, various > complex form of windows, and great data source and sink support in a > unified framework. It's also, IMHO, much simpler than most existing APIs > for this stuff (i.e. look at the number of concepts you have to learn for > those versus for this). However, this project is still very early on -- > only the bare minimum API came out in 2.0. It's marked as alpha and it's > precisely the type of system where I'd expect the API to improve in > response to feedback. As with other APIs, such as Spark SQL's SchemaRDD and > DataFrame, I think it's good to get it in front of *users* quickly and > receive feedback -- even developers discussing among themselves can't > anticipate all user needs. > > 2) Adding things in Spark Streaming. I haven't personally worked much on > this lately, but it is a very reasonable thing that I'd love to see the > project do to help current users. For example, consider adding an > aggregate-by-event-time operator to Spark Streaming (it can be done using > mapWithState), or a sessionization operator, etc. > > 3) Another thing that I think is possible is just lowering the latency of > both Spark Streaming and Structured Streaming by 10x -- a few folks at > Berkeley have been working on this ( > https://spark-summit.org/2016/events/low-latency-execution-for-apache-spark/). > Happy to fork off a thread about how to do it. Their current system > requires some new concepts in the Spark scheduler, but from measuring stuff > it also seems that you can get somewhere with less intensive changes (most > of the overhead is in RPCs, not in the scheduling logic or task execution). > > - Jira > Concretely, automate closing stale jiras after X amount of time. It's > really surprising to me how much reluctance a community of programmers > have shown towards automating their own processes around stuff like > this (not to mention automatic code formatting of modified files). I > understand the arguments against. but the current alternative doesn't > work. > Concretely, clearly reject and close jiras. I have a backlog of 50+ > kafka jiras, many of which are irrelevant at this point, but I do not > feel that I have the political power to close them. > Concretely, make it clear who is working on something. This can be as > simple as just "I'm working on this", assign it to me, if I don't > follow up in X amount of time, close it or reassign. That doesn't > mean there can't be competing work, but it does mean those people > should talk to each other. Conversely, if committers currently don't > have time to work on something that is important, make that clear in > the ticket. > > > Definitely agree with marking who's working on something early on, and > timing it out if inactive. For closing JIRAs, I think the best way I've > seen is for people to go through them once in a while. Automated closing is > too impersonal IMO -- if I opened a JIRA on a project and nobody looked at > it and that happened to me, I'd actively feel ignored. If you do that, > you'll see people on stage saying "I reported a bug for Spark and some bot > just closed it after 3 months", which is not ideal. > > Matei > > > > > On Fri, Oct 7, 2016 at 5:34 AM, Sean Owen <so...@cloudera.com> wrote: > > Suggestion actions way at the bottom. > > > > On Fri, Oct 7, 2016 at 5:14 AM Matei Zaharia <matei.zaha...@gmail.com> > > wrote: > >> > >> since March. But it's true that other things such as the Kafka source > for > >> it didn't have as much design on JIRA. Nonetheless, this component is > still > >> early on and there's still a lot of time to change it, which is > happening. > > > > > > It's hard to drive design discussions in OSS. Even when diligently > > publishing design docs, the doc happens after brainstorming, and that > > happens inside someone's head or in chats. > > > > The lazy consensus model that works for small changes doesn't work well > > here. If a committer wants a change, that change will basically be made > > modulo small edits; vetoes are for dire disagreement. (Otherwise we'd get > > nothing done.) However this model means it's hard to significantly > change a > > design after draft 1. > > > > I've heard this complaint a few times, and it has never been down to bad > > faith. We should err further towards over-including early and often. I've > > seen some great discussions start more with a problem statement and an > RFC, > > not a design doc. Keeping regular contributors enfranchised is > essential, so > > that they're willing and able to participate when design time comes. (See > > below.) > > > > > >> > >> 2) About what people say at Reactive Summit -- there will always be > >> trolls, but just ignore them and build a great project. Those of us > involved > >> in the project for a while have long seen similar stuff, e.g. a > > > > > > The hype cycle may be turning against Spark, as is normal for this stage > of > > maturity. People idealize technologies they don't really use as greener > > grass; it's the things they use and need to work that they love to hate. > > > > I would not dismiss this as just trolling. Customer anecdotes I see > suggest > > that Spark underperforms their (inflated) expectations, and generally > does > > not Just Work. It takes expertise, tuning, patience, workarounds. And > then > > it gets great things done. I do see a gap between how the group here > talks > > about the technology, and how the users I see talk about it. The gap > > manifests in attention given to making yet more things, and attention > given > > to fixing and project mechanics. > > > > I would also not dismiss criticism of governance. We can recognize some > big > > problems that were resolved over even the past 3 months. Usually I hear, > > well, we do better than most projects, right? and that is true. But, > Spark > > is bigger and busier than most any other project. Exceptional projects > need > > exceptional governance and we have merely "good". See next. > > > > > >> 3) About number and diversity of committers -- the PMC is always working > >> to expand these, and you should email people on the PMC (or even the > whole > >> list) if you have people you'd like to propose. In > > > > > > If you're suggesting that it's mostly a matter of asking, then this > doesn't > > match my experience. I have seen a few people consistently soft-reject > most > > proposals. The reasons given usually sound like "concerns about quality", > > which is probably the right answer to a somewhat wrong question. > > > > We should probably be asking primarily who will net-net add efficiency to > > some part of the project's mechanics. Per above, it wouldn't hurt to ask > who > > would expand coverage and add diversity of perspective too. > > > > I disagree that committers are being added at a sufficient rate. The > overall > > committer-attention hours is dropping as the project grows -- am I the > only > > one that perceives many regular committers aren't working nearly as much > as > > before on the project? > > > > I call it a problem because we have IMHO people who 'qualify', and not > > giving them some stake is going to cost the project down the road. > Always Be > > Recruiting. This is what I would worry about, since the governance and > > enfranchisement issues above kind of stem from this. > > > > > >> > >> 4) Finally, about better organizing JIRA, marking dead issues, etc, this > >> would be great and I think we just need a concrete proposal for how to > do > >> it. It would be best to point to an existing process that someone else > has > >> used here BTW so that we can see it in action. > > > > > > I don't think we're wanting for proposals. I went on and on about it last > > year, and don't think anyone disagreed about actions. I wouldn't suggest > > that clearing out dead issues is more complex than just putting in time > to > > do it. It's just grunt work and understandably not appealing. (Thank you > > Xiao for your recent run at SQL JIRAs.) > > > > It requires saying 'no', which is hard, because it requires some > conviction. > > I have encountered reluctance to do this in Spark and think that culture > > should change. Is it weird to say that a broader group of gatekeepers can > > actually with more confidence and efficiency tackle the triage issue? > that > > pushing back on 'bad' contribution actually increases the rate of 'good'? > > > > FWIW I also find the project unpleasant to deal with day to day, mostly > > because of the scale of the triage, and think we could use all the > qualified > > help we can get. I am looking to do less with the project over time, > which > > is no big deal in itself, but is a big deal if these several factors are > > adding up to discourage fresh blood from joining the fray. Cody makes me > > think there are, at least, 2 of us. > > > > Concrete steps? > > > > Go to spark-prs.com. Look at "Users". Look at your open PRs. Are any > stale? > > can you close them or advance them? > > > > Look at the Stale PRs tab and sort by last updated. Do any look dead? can > > you ask the author to update or close? does the parent JIRA look like > it's > > not otherwise relevant? > > > > Go download JIRA Client at http://almworks.com/jiraclient/download.html > Go > > look at all open JIRAs sorted by last update. Are any pretty obviously > > obsolete? > > > > If you don't feel comfortable acting, feel free to at least propose a > list > > to dev@ for a look. > > --------------------------------------------------------------------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > > > > -- > Cell : 425-233-8271 <(425)%20233-8271> > Twitter: https://twitter.com/holdenkarau > >