On Apr 13, 2014 9:45 AM, "Sebastian Schelter" <s...@apache.org> wrote: > > Hi, > > I took some days to let the latest discussion about the state and future of Mahout go through my head. I think the most important thing to address right now is the MapReduce "legacy" codebase. A lot of the MR algorithms are currently unmaintained, documentation is outdated and the original authors have abandoned Mahout. For some algorithms it is hard to get even questions answered on the mailinglist (e.g. RandomForest). I agree with Sean's comments that letting the code linger around is no option and will continue to harm Mahout. > > In the previous discussion, I suggested to make a radical move and aim to delete this codebase, but there were serious objections from committers and users that convinced me that there is still usage of and interested in that codebase. > > That puts us into a "legacy dilemma". We cannot delete the code without harming our userbase. On the other hand, I don't see anyone willing to rework the codebase. Further, the code cannot linger around anymore as it is doing now, especially when we fail to answer questions or don't provide documentation. > > *We have to make a move*! > > I suggest the following actions with regard to the MR codebase. I hope that they find consent. If there are objections, please give alternatives, *keeping everything as-is is not an option*: > > * reject any future MR algorithm contributions, prominently state this on the website and in talks +1, but more importantly, reject any new author who doesn't agree to explicitly plegdge a multi-year support. > * make all existing algorithm code compatible with Hadoop 2, if there is no one willing to make an existing algorithm compatible, remove the algorithm Ok, although my gut feeling this would take some time
> * deprecate the existing MR algorithms, yet still take bug fix contributions I foresee a bit smoother mr transition. Deprecation means we loose them in a release. That is, by the fall release. It would seem to me it would take longer for us to provide full repleacement and convince ourselves of its production worthiness. Also, deprecation implies we can point a user to something else with "use instead". So i wouldn't deprecate methods just now for which we cannot add this phrase. As somebody menioned, long tail for deprecation is a good policy here imo. > * remove Random Forest as we cannot even answer questions to the implementation on the mailinglist Do we know a direct email for FPM and random forest authors? I 'd suggest to ping them one last time. They just may not be tuned to the list. Both algorithms are kind of in a bread-and -butter category, it would be a huge hit in coverage to just lose them without any resuscitation attempt whatsoever. > > There are two more actions that I would like to see, but'd be willing to give up if there are objections: > > * move the MR algorithms into a separate maven module You mean, move them out of mahout-core? So the core is for single machine stuff only? Plus utils? We probably need to refactor core so there's no core at all it seems. Our core, realistically, is utils, mahout-math & math-scala(aka scalabindings), engine-agnostic logical layer of mahout-spark. But for obvious reasons we probably dont want to put all that in a single module. Maybe at some point later when these things become more mainstream. > * remove Frequent Pattern Mining again (we already aimed for that in 0.9 but had one user who shouted but never returned to us) > > Let me know what you think. > > --sebastian