First off, thanks for bringing this up!

On Sep 4, 2009, at 9:13 AM, Sean Owen wrote:

Guys, quick and broad question -- what's the roadmap for Mahout look
like? Even just for the next two releases?

I asked a little while back about this. I think we can put out 0.2 out after Robin and Deneche get their pieces in (Random Forests, classification refactoring), which hopefully should be soon since they are now committers.

We've cleaned up a lot and made a number of improvements in the code since 0.1, it would be good to get them out to a broader audience.

After that, I don't think we particularly have to go through all the 0.X (0.3, 0.4, ...) integers on our way to 1.0. The primary goal before 1.0 is to make sure we are happy with the APIs before (to some extent) "locking them down" for 1.0, but I'm not sure we need to be that worried about locking down, since most of our code isn't public APIs anyway and we need not necessarily worry about back compatibility. I think the other primary thing we need is to get some larger scale testing in place. I believe Amazon still has in place it's committers program such that committers can get access to EC2 credits for testing. Let me know if anyone needs an account.



Now, much of the project is mostly a space for tinkering, tossing
around bits of code for now, and that's OK for 0.1 or 0.2. I just
wonder what the path to a proper finished product is like. It'll take
some agreement on who exactly the audience is, what they need and
don't need, what interface it presents to those users. It takes work
to design for that, bring the project into line around that design,
document and test, etc. And -- it takes people with responsibility and
authority to make it happen.

I think what we have now goes beyond tinkering, but yes, we are exploring what works and what doesn't. We've got several active committers and some active contributors, which are all good signs and we actually have a pretty healthy base of mailing list subscribers lurking. We also have users coming in and kicking the tires, we need to capture their needs and keep them interested by responding quickly and in a helpful way. We also need to find a way to pull the lurkers out to help by providing an ever more compelling story.

Open source is always incremental and it takes time to build. It really is never done and I find O/S is often much more fluid than products.




I'm not clear we quite have those things yet. Until we do this will be
an 0.x project that nobody can really get into using for production.
It doesn't have to happen tomorrow, but, what's our path like from
here to there? Spare time from even 10 people won't get the docs
written, tidy the code, refactor / redesign / unify the lot of
copy/paste that's going on, etc. People definitely have ideas about
what the project should do -- I see lots of little bits of
functionality being thrown into the pot. But is it adding up to
something consistent and coherent? should we talk seriously about it?
"Machine learning" is too broad a remit.

I think we are getting there. Some of the answer is above in the first part where I talk about releases. I do think the bits are adding up to real machine learning functionality. We've got utilities in place for getting data into formats that are consumable, we've got implementations that consume those formats and produce outputs. More examples, etc. will always help and of course documentation.

It took Lucene 6+ years to reach what I would call a really capable system. The early stages were promising and worked for many, but it was not until 2004-05 that it really started taking off. Not saying that Mahout will take that long, especially given how widely adopted/accepted Open Source is now as compared to the early days of Lucene, but it does take time. That being said, we certainly need to get more people looking at more parts of the code and proposing and implementing improvements.


It's not ruining my day or anything but I'm sitting on a piece of the
project that I put effort into making clearly do a few things, do them
well, and not try to do other things, designed for practical use
cases, and documented and polished and tested it. So I'll be a little
concerned if it's attached to an early-0.x tinkering project this time
next year. That's not cool for an Apache project anyway.

Agreed. Let's get what is marked for 0.2 done and look to release soon thereafter (mid October?) From there, I likely guess we could do a 0.3 (or even 0.9) in the early Jan.- March time frame and then look to make a 1.0 in early Summer. People contributing and pushing can obviously push this up. Our job as the committers is to make sure, to some extent, that their efforts don't go wasted.


It may be presumptuous but I volunteer to try to lead answers to these
questions. It's going to lead to some tough answers and more work in
some cases, no matter who drives it. Hoping to do it sooner than
later.

Not at all presumptuous. This is in fact how it works at Apache. Right or wrong, those who do get to make the decisions. That's how the meritocracy works. I personally am committed and I know several others are (obviously, including you) as well.

-Grant

Reply via email to