On Sep 7, 2009, at 4:49 AM, Sean Owen wrote:

I am sure the project needs to refactor and unify the Hadoop-related
code. There's a lot of copy and paste at this stage. That would go
some way towards abstracting away Hadoop -- would tend to centralize
the dependency.

I think there's a lot more to it -- abstracting away contacting a
cluster? running a job? storing and reading data? Then you're also
learning how to configure Mahout's layer, as well as your underlying
infrastructure. My gut says it's hard, compared to the value it could
add. Given that Hadoop is the de facto standard and big clouds like
Amazon directly support it, it seems unlikely someone would not be
able to use Hadoop. It's all just my guess given my impressions...

My meta-concern is that we don't really have a polished, finished
approach to using even Hadoop (which is again to be expected given
it's early, and given Hadoop is evolving fast too) -- so would rather
focus on tying up loose ends, or documenting and testing, before
reaching too much farther.


The hard thing about all of this is, in open source, you never know where the next good idea is coming from, especially in community- driven projects (as opposed to the "benevolent dictator" models where one or two people drive the whole thing.) You can plan all you want, but when someone comes along with some really nice idea that doesn't fit into your plans, it's pretty hard to turn them away when it meets with the general goals of the project. For instance, there is a PLSI implementation sitting in JIRA that just so happens to be implemented in Pig. I can't say Pig was in my original plans, but I have no objection to it and plan on committing it once I review it. Even LDA and the frequent pattern mining weren't in the "original" plans, yet I think they are welcome additions.

That's not to say we shouldn't clean things up and do some planning, but as with everything, it's all going to be driven by how people contribute and who takes on the work. The plus side to cleaning up, etc. is that it should make it easier for people to contribute.

-Grant

Reply via email to