Re: Tackling the "legacy dilemma"

Dmitriy Lyubimov Sun, 13 Apr 2014 10:18:12 -0700

On Apr 13, 2014 9:45 AM, "Sebastian Schelter" <s...@apache.org> wrote:
>
> Hi,
>
> I took some days to let the latest discussion about the state and future
of Mahout go through my head. I think the most important thing to address
right now is the MapReduce "legacy" codebase. A lot of the MR algorithms
are currently unmaintained, documentation is outdated and the original
authors have abandoned Mahout. For some algorithms it is hard to get even
questions answered on the mailinglist (e.g. RandomForest). I agree with
Sean's comments that letting the code linger around is no option and will
continue to harm Mahout.
>
> In the previous discussion, I suggested to make a radical move and aim to
delete this codebase, but there were serious objections from committers and
users that convinced me that there is still usage of and interested in that
codebase.
>
> That puts us into a "legacy dilemma". We cannot delete the code without
harming our userbase. On the other hand, I don't see anyone willing to
rework the codebase. Further, the code cannot linger around anymore as it
is doing now, especially when we fail to answer questions or don't provide
documentation.
>
> *We have to make a move*!
>
> I suggest the following actions with regard to the MR codebase. I hope
that they find consent. If there are objections, please give alternatives,
*keeping everything as-is is not an option*:
>
>  * reject any future MR algorithm contributions, prominently state this
on the website and in talks
+1, but more importantly, reject any new author who doesn't agree to
explicitly plegdge a multi-year support.
>  * make all existing algorithm code compatible with Hadoop 2, if there is
no one willing to make an existing algorithm compatible, remove the
algorithm
Ok, although my gut feeling this would take some time


>  * deprecate the existing MR algorithms, yet still take bug fix
contributions
I foresee a bit smoother mr transition. Deprecation means we loose them in
a release. That is, by the fall release. It would seem to me it would take
longer for us to provide full repleacement and convince ourselves of its
production worthiness.
Also, deprecation implies we can point a user to something else with "use
instead". So i wouldn't deprecate methods just now for which we cannot add
this phrase. As somebody menioned, long tail for deprecation is a good
policy here imo.

>  * remove Random Forest as we cannot even answer questions to the
implementation on the mailinglist

Do we know a direct email for FPM and random forest authors? I 'd suggest
to ping them one last time. They just may not be tuned to the list. Both
algorithms are kind of in a bread-and -butter category, it would be a huge
hit in coverage to just lose them without any resuscitation attempt
whatsoever.

>
> There are two more actions that I would like to see, but'd be willing to
give up if there are objections:
>
>  * move the MR algorithms into a separate maven module
You mean, move  them out  of mahout-core? So the core is for single machine
stuff only? Plus utils? We probably need to refactor core so there's no
core at all it seems. Our core, realistically, is utils, mahout-math &
math-scala(aka scalabindings), engine-agnostic logical layer of
mahout-spark. But for obvious reasons we probably dont want to put all that
in a single module. Maybe at some point later when these things become more
mainstream.

>  * remove Frequent Pattern Mining again (we already aimed for that in 0.9
but had one user who shouted but never returned to us)
>
> Let me know what you think.
>
> --sebastian

Re: Tackling the "legacy dilemma"

Reply via email to