On Oct 22, 2011, at 2:19 PM, Sean Owen wrote:

> Bringing this to dev@, mid-thread, per Grant's suggestion. There was a
> brief and fruitful thread on private@ to discuss project governance,
> but the topic has shifted such that it's useful to just talk on dev@.
> 
> If I may paraphrase: I expressed concern about the sprawl of code and
> algorithms, aging of JIRA issues, and said I thought we probably had
> too few "zookeepers" (to borrow Benson's metaphor) for this big a zoo,
> and too many "breeders" adding new bags of code here and there. People
> seem to have interest in new non-Hadoop platforms in particular, but
> are not giving much attention to what's there now. I expressed concern
> this would lead to an increasingly difficult mess of code and project
> identity, even before 1.0.
> 
> 
> I proposed narrowing the scope of the project -- while not rejecting
> all new code, strongly weighting towards contributions that enhance
> and fix existing code rather than new green-field code through the 0.6
> and 1.0 release. I also proposed a concerted effort to clean up JIRA
> in the short term.
> 
> Most discussion centered around the final proposal: in exchange for
> putting off a lot of new, different stuff, I proposed thinking of
> "Mahout2" as a place for those ideas -- perhaps even as not
> Hadoop-based. It could be a mostly green-field rewrite. Now, keep in
> mind this would be quite a ways away -- just a point of discussion
> now. But it might be a good conceptual rationalization for nearly
> freezing scope now and polishing -- because we'd have a future bucket
> to put these in and time to talk it over.

As background, I think many of us are realizing that Hadoop isn't great for the 
actual learning process in all cases, but instead, you use it up front to do 
ETL, dimensionality reduction (SVD, random projection, etc.) and then you want 
a fast, potentially distributed, likely iterative approach and one that likely 
keeps most things in memory if possible.

> 
> So: comments are certainly welcome on the above!
> 
> 
> And now I'd like to rejoin the thread by replying to Grant below:
> 
> My broader concern is that an unhealthy project is the bigger danger
> to the community. And, I see some patterns that feel harmful, to me.

Of course.  I guess we just don't agree on the harm factor.  The fact of that 
matter is, this is still an all volunteer project.  

> 
> I do see large patches sit in JIRA for 6 months and get cancelled.
> That's not good -- either it was a good patch and should have been
> picked up, or it wasn't suitable, and it should have been clearer it
> wasn't suitable before the contributor went to the trouble.
> 
> I see JIRAs tagged for version X, and then untouched and slipped to
> version X+1, X+2. This means that the community doesn't have credible
> information about when an issue is going to be addressed. In fact they
> don't have info about *if* the issue is even good to address, and so
> worth working in (see point above).
> 
> Finally I see a lot of "Someone should do X at some point" JIRAs, and
> Someone rarely does them. While these feel like work and progress, I
> think they're harmful: it shows the community that to-dos aren't
> always done. It condones a culture of post-it-notes for future work.
> 
> 
> I assert that even in open source (perhaps all the more?) we do need
> enough project coordination such that these problems don't crop up. We
> should be able to meet basic expectations about scope, process, and
> roadmap -- not nearly as much as a corporate software project, but
> some semblance of it. This is a separate thing from providing space
> for ideas, to-dos, thoughts, bits of code, etc.
> 
> And maybe we're just disagreeing about how to implement those two
> things. I think JIRA is for project coordination, and I think the
> mailing list or wiki, or Github if you like are for ideas, open-ended
> brainstorms, to-dos. Taking something into JIRA and letting it sit
> there, to me, is therefore hurting and not helping (see above).
> 
> If you see JIRA as a place for ideas and loose ends -- then of course
> this doesn't look like any problem! But then I'd ask, where's the
> project plan? Because we need that.

I simply don't see how anything but a very loose project plan (below is my 
understanding of our current plan, which I think is good) can happen in reality 
and I don't see most of the last few paragraphs you describe as harmful.  The 
fundamental strength and weakness of open source at the ASF simply boils down 
to: "You never know where the next good idea is coming from" and it's 
corollary: "You never know where the next major bug is coming from".  Thus, my 
take on planning is:

1. We aim for releases every 6 months or so
2. We make a best guess up front about what bug fixes will be in that release, 
but we also will, obviously, bring in other fixes as they are reported and 
dealt with
3. As for features, we are all committers and we all have our itches that we 
wish to scratch.  I can't predict 6 months out what that will be.  Nor can you. 
 And if we have some project plan that prevents me from adding my feature b/c 
it isn't in this release cycle and I have to wait 3 months to do so, then all 
you've (not you, personally, but the "royal" you) done is discourage me from 
contributing it at all!  So, then, I go over to Github and do it there.  Next 
thing you know, my Fork on Github is truly a fork.  Now multiply that by many 
others and we have no project at all.  My experience is, and it could be wrong, 
is that the coordination we need often comes about naturally by people simply 
picking up the yoke of the things they want done and plowing forward.  If 
others like it, they will pitch in and help.  If they don't like it, they will 
either pitch in and help or be quiet, because ultimately, those who do the work 
have the say.

In other words, I think we already do most of this.  We just don't have enough 
people doing the actual work to accomplish all that we aim to accomplish.  A 
project plan isn't going to change that.  

Besides, I guess I would assert we already have a plan.  It goes something like 
this:  Build scalable machine learning libs, primarily focused on the 3 C's.  
To that end, we all know we need to:
1. Coalesce the clustering/classification stuff
2. Improve our ETL layer, feature selection libraries, examples and output
3. Document this stuff and clean up the wiki
4. Write more/better tests
5. Get automated patch checking in.
6. Test at scale.

At least, that's what I've heard us all say on repeated occasions for some time 
now.  We've just acted on it in fits and starts.  So, now the question is, who 
is ready to act on it?

----

As for JIRA as post it notes, I think it's the only way we get reliable 
contributions from the community that are easily surfaced, findable and 
actionable.  Everything else you propose above (Github, mailing list, etc.) 
does not work for that and some actually make the problem harder (Github in 
particular, if the author accepts pull requests from others).  It is also the 
only way we can reliably know that patches, in what ever state, are clearly 
marked as being donated to the ASF should we choose to incorporate them.  

As for the community having credible information, that is life.  Communities 
around proprietary projects have the exact same problem, it's just here you get 
to see the bug database.  If someone wants an issue addressed, they need to 
push on it and be persistent.  In the end, they can simply ask what the status 
is.

Lucene and most every other project at the ASF works exactly as you describe 
above, AFAICT.  Even the projects with huge corporate backers have the exact 
same issues.  Some times, unfortunately, it seems to be chaos, but in the end, 
it works out, b/c of the meritocracy.  In my mind, I simply try to focus on 
managing the things I can while keeping in mind the bigger goals of the 
project.  

> 
> I don't think that it's wrong to close an issue that hasn't been
> touched for 9 months as WontFix. I'm not being anti-community.

I don't think you are being anti-community.  For the most part, I think you are 
just creating work for yourself.  Secondly, I think you are delivering news 
that discourages contributions at a later date.  That being said, I'm not going 
to stop you from doing it.  I will simply reopen the ones I still want to keep, 
even if they only ever get done in my optimal world.  

At a minimum, please tell me you have a script and cron job that simply goes in 
and marks them every 9 months.  :-)

Practically speaking, I guess my proposal would be that we simply move to a 
mark and sweep model with every release that:
1. Leaves open everything by default with no version number
2. Moves all unfixed items for a specific version to the next one unless 
otherwise marked elsewhere.  This way we have a list that we can refer to.
3. Marks any open ones that we deem to be DOA as "Won't Fix"

Also, perhaps, as a compromise here, when you do the bulk mark and sweep, you 
can add a comment on the issue that says something to the effect of:  "This 
issue has been closed due to lack of any visible progress for the past X 
months.  If you wish to work on it, please reopen the issue and start iterating 
on it."

FWIW, this JIRA issue might be a good discussion question for community@a.o.

> 
> 
> My answer of course is simple: rein in scope to match effort
> available. It's simplistic but sure works.
> 
> 

I definitely agree w/ reining in scope.  Ironically, doing a cull of code is 
actually increasing scope in the short term since now we need to do that as 
well as clean up issues, add features, etc.  I'm for such a cull, but I suspect 
it will be less culling and more reorganization.  I'd simply suggest that 
instead of trying to do some massive coordination of such an effort, you 
(again, the "royal" you) simply open up jira issues, attach a patch and, when 
you are satisfied w/ the patch, you indicate on the issue that you will commit 
within 3 days unless you hear objections.  Three days later, do the commit.   
My first suggestion would be Watchmaker!

-Grant

Reply via email to