On Oct 22, 2011, at 2:19 PM, Sean Owen wrote: > Bringing this to dev@, mid-thread, per Grant's suggestion. There was a > brief and fruitful thread on private@ to discuss project governance, > but the topic has shifted such that it's useful to just talk on dev@. > > If I may paraphrase: I expressed concern about the sprawl of code and > algorithms, aging of JIRA issues, and said I thought we probably had > too few "zookeepers" (to borrow Benson's metaphor) for this big a zoo, > and too many "breeders" adding new bags of code here and there. People > seem to have interest in new non-Hadoop platforms in particular, but > are not giving much attention to what's there now. I expressed concern > this would lead to an increasingly difficult mess of code and project > identity, even before 1.0. > > > I proposed narrowing the scope of the project -- while not rejecting > all new code, strongly weighting towards contributions that enhance > and fix existing code rather than new green-field code through the 0.6 > and 1.0 release. I also proposed a concerted effort to clean up JIRA > in the short term. > > Most discussion centered around the final proposal: in exchange for > putting off a lot of new, different stuff, I proposed thinking of > "Mahout2" as a place for those ideas -- perhaps even as not > Hadoop-based. It could be a mostly green-field rewrite. Now, keep in > mind this would be quite a ways away -- just a point of discussion > now. But it might be a good conceptual rationalization for nearly > freezing scope now and polishing -- because we'd have a future bucket > to put these in and time to talk it over.
As background, I think many of us are realizing that Hadoop isn't great for the actual learning process in all cases, but instead, you use it up front to do ETL, dimensionality reduction (SVD, random projection, etc.) and then you want a fast, potentially distributed, likely iterative approach and one that likely keeps most things in memory if possible. > > So: comments are certainly welcome on the above! > > > And now I'd like to rejoin the thread by replying to Grant below: > > My broader concern is that an unhealthy project is the bigger danger > to the community. And, I see some patterns that feel harmful, to me. Of course. I guess we just don't agree on the harm factor. The fact of that matter is, this is still an all volunteer project. > > I do see large patches sit in JIRA for 6 months and get cancelled. > That's not good -- either it was a good patch and should have been > picked up, or it wasn't suitable, and it should have been clearer it > wasn't suitable before the contributor went to the trouble. > > I see JIRAs tagged for version X, and then untouched and slipped to > version X+1, X+2. This means that the community doesn't have credible > information about when an issue is going to be addressed. In fact they > don't have info about *if* the issue is even good to address, and so > worth working in (see point above). > > Finally I see a lot of "Someone should do X at some point" JIRAs, and > Someone rarely does them. While these feel like work and progress, I > think they're harmful: it shows the community that to-dos aren't > always done. It condones a culture of post-it-notes for future work. > > > I assert that even in open source (perhaps all the more?) we do need > enough project coordination such that these problems don't crop up. We > should be able to meet basic expectations about scope, process, and > roadmap -- not nearly as much as a corporate software project, but > some semblance of it. This is a separate thing from providing space > for ideas, to-dos, thoughts, bits of code, etc. > > And maybe we're just disagreeing about how to implement those two > things. I think JIRA is for project coordination, and I think the > mailing list or wiki, or Github if you like are for ideas, open-ended > brainstorms, to-dos. Taking something into JIRA and letting it sit > there, to me, is therefore hurting and not helping (see above). > > If you see JIRA as a place for ideas and loose ends -- then of course > this doesn't look like any problem! But then I'd ask, where's the > project plan? Because we need that. I simply don't see how anything but a very loose project plan (below is my understanding of our current plan, which I think is good) can happen in reality and I don't see most of the last few paragraphs you describe as harmful. The fundamental strength and weakness of open source at the ASF simply boils down to: "You never know where the next good idea is coming from" and it's corollary: "You never know where the next major bug is coming from". Thus, my take on planning is: 1. We aim for releases every 6 months or so 2. We make a best guess up front about what bug fixes will be in that release, but we also will, obviously, bring in other fixes as they are reported and dealt with 3. As for features, we are all committers and we all have our itches that we wish to scratch. I can't predict 6 months out what that will be. Nor can you. And if we have some project plan that prevents me from adding my feature b/c it isn't in this release cycle and I have to wait 3 months to do so, then all you've (not you, personally, but the "royal" you) done is discourage me from contributing it at all! So, then, I go over to Github and do it there. Next thing you know, my Fork on Github is truly a fork. Now multiply that by many others and we have no project at all. My experience is, and it could be wrong, is that the coordination we need often comes about naturally by people simply picking up the yoke of the things they want done and plowing forward. If others like it, they will pitch in and help. If they don't like it, they will either pitch in and help or be quiet, because ultimately, those who do the work have the say. In other words, I think we already do most of this. We just don't have enough people doing the actual work to accomplish all that we aim to accomplish. A project plan isn't going to change that. Besides, I guess I would assert we already have a plan. It goes something like this: Build scalable machine learning libs, primarily focused on the 3 C's. To that end, we all know we need to: 1. Coalesce the clustering/classification stuff 2. Improve our ETL layer, feature selection libraries, examples and output 3. Document this stuff and clean up the wiki 4. Write more/better tests 5. Get automated patch checking in. 6. Test at scale. At least, that's what I've heard us all say on repeated occasions for some time now. We've just acted on it in fits and starts. So, now the question is, who is ready to act on it? ---- As for JIRA as post it notes, I think it's the only way we get reliable contributions from the community that are easily surfaced, findable and actionable. Everything else you propose above (Github, mailing list, etc.) does not work for that and some actually make the problem harder (Github in particular, if the author accepts pull requests from others). It is also the only way we can reliably know that patches, in what ever state, are clearly marked as being donated to the ASF should we choose to incorporate them. As for the community having credible information, that is life. Communities around proprietary projects have the exact same problem, it's just here you get to see the bug database. If someone wants an issue addressed, they need to push on it and be persistent. In the end, they can simply ask what the status is. Lucene and most every other project at the ASF works exactly as you describe above, AFAICT. Even the projects with huge corporate backers have the exact same issues. Some times, unfortunately, it seems to be chaos, but in the end, it works out, b/c of the meritocracy. In my mind, I simply try to focus on managing the things I can while keeping in mind the bigger goals of the project. > > I don't think that it's wrong to close an issue that hasn't been > touched for 9 months as WontFix. I'm not being anti-community. I don't think you are being anti-community. For the most part, I think you are just creating work for yourself. Secondly, I think you are delivering news that discourages contributions at a later date. That being said, I'm not going to stop you from doing it. I will simply reopen the ones I still want to keep, even if they only ever get done in my optimal world. At a minimum, please tell me you have a script and cron job that simply goes in and marks them every 9 months. :-) Practically speaking, I guess my proposal would be that we simply move to a mark and sweep model with every release that: 1. Leaves open everything by default with no version number 2. Moves all unfixed items for a specific version to the next one unless otherwise marked elsewhere. This way we have a list that we can refer to. 3. Marks any open ones that we deem to be DOA as "Won't Fix" Also, perhaps, as a compromise here, when you do the bulk mark and sweep, you can add a comment on the issue that says something to the effect of: "This issue has been closed due to lack of any visible progress for the past X months. If you wish to work on it, please reopen the issue and start iterating on it." FWIW, this JIRA issue might be a good discussion question for community@a.o. > > > My answer of course is simple: rein in scope to match effort > available. It's simplistic but sure works. > > I definitely agree w/ reining in scope. Ironically, doing a cull of code is actually increasing scope in the short term since now we need to do that as well as clean up issues, add features, etc. I'm for such a cull, but I suspect it will be less culling and more reorganization. I'd simply suggest that instead of trying to do some massive coordination of such an effort, you (again, the "royal" you) simply open up jira issues, attach a patch and, when you are satisfied w/ the patch, you indicate on the issue that you will commit within 3 days unless you hear objections. Three days later, do the commit. My first suggestion would be Watchmaker! -Grant