On Sep 7, 2009, at 4:49 AM, Sean Owen wrote:
I am sure the project needs to refactor and unify the Hadoop-related code. There's a lot of copy and paste at this stage. That would go some way towards abstracting away Hadoop -- would tend to centralize the dependency. I think there's a lot more to it -- abstracting away contacting a cluster? running a job? storing and reading data? Then you're also learning how to configure Mahout's layer, as well as your underlying infrastructure. My gut says it's hard, compared to the value it could add. Given that Hadoop is the de facto standard and big clouds like Amazon directly support it, it seems unlikely someone would not be able to use Hadoop. It's all just my guess given my impressions... My meta-concern is that we don't really have a polished, finished approach to using even Hadoop (which is again to be expected given it's early, and given Hadoop is evolving fast too) -- so would rather focus on tying up loose ends, or documenting and testing, before reaching too much farther.
The hard thing about all of this is, in open source, you never know where the next good idea is coming from, especially in community- driven projects (as opposed to the "benevolent dictator" models where one or two people drive the whole thing.) You can plan all you want, but when someone comes along with some really nice idea that doesn't fit into your plans, it's pretty hard to turn them away when it meets with the general goals of the project. For instance, there is a PLSI implementation sitting in JIRA that just so happens to be implemented in Pig. I can't say Pig was in my original plans, but I have no objection to it and plan on committing it once I review it. Even LDA and the frequent pattern mining weren't in the "original" plans, yet I think they are welcome additions.
That's not to say we shouldn't clean things up and do some planning, but as with everything, it's all going to be driven by how people contribute and who takes on the work. The plus side to cleaning up, etc. is that it should make it easier for people to contribute.
-Grant