Re: Tackling the "legacy dilemma"

Sebastian Schelter Mon, 14 Apr 2014 21:59:18 -0700

Hi,

From reading the thread, I have the impression that we agree on thefollowing actions:


 * reject any future MR algorithm contributions, prominently state this
on the website and in talks

* make all existing algorithm code compatible with Hadoop 2, if thereis no one willing to make an existing algorithm compatible, remove thealgorithm

 * deprecate Canopy clustering

* email the original FPM and random forest authors to ask formaintenance of the algorithms* rename core to "mr-legacy" (and gradually pull items we really needout of that later)

I will create jira tickets for those action points. I think the biggestchallenge here is the Hadoop 2 compatibility, is someone volunteering todrive that? Would be awesome.


Best,
Sebastian


On 04/13/2014 07:19 PM, Andrew Musselman wrote:

This is a good summary of how I feel too.

On Apr 13, 2014, at 10:15 AM, Sebastian Schelter <s...@apache.org> wrote:

Unfortunately, its not that easy to get enough voluntary work. I issued the 
third call for working on the documentation today as there are still lots of 
open issues. That's why I'm trying to suggest a move that involves as few work 
as possible.

We should get the MR codebase into a state that we all can live with and then 
focus on new stuff like the scala DSL.

--sebastian

On 04/13/2014 07:09 PM, Giorgio Zoppi wrote:
The best thing, should be do a plan, and see how much effort do you need to
this. Then find out voluntaries to accomplish the task. Quite sure that
there a lot of people around there that they are willing to help out.

BR,
deneb.


2014-04-13 18:45 GMT+02:00 Sebastian Schelter <s...@apache.org>:

Hi,

I took some days to let the latest discussion about the state and future
of Mahout go through my head. I think the most important thing to address
right now is the MapReduce "legacy" codebase. A lot of the MR algorithms
are currently unmaintained, documentation is outdated and the original
authors have abandoned Mahout. For some algorithms it is hard to get even
questions answered on the mailinglist (e.g. RandomForest). I agree with
Sean's comments that letting the code linger around is no option and will
continue to harm Mahout.

In the previous discussion, I suggested to make a radical move and aim to
delete this codebase, but there were serious objections from committers and
users that convinced me that there is still usage of and interested in that
codebase.

That puts us into a "legacy dilemma". We cannot delete the code without
harming our userbase. On the other hand, I don't see anyone willing to
rework the codebase. Further, the code cannot linger around anymore as it
is doing now, especially when we fail to answer questions or don't provide
documentation.

*We have to make a move*!

I suggest the following actions with regard to the MR codebase. I hope
that they find consent. If there are objections, please give alternatives,
*keeping everything as-is is not an option*:

  * reject any future MR algorithm contributions, prominently state this on
the website and in talks
  * make all existing algorithm code compatible with Hadoop 2, if there is
no one willing to make an existing algorithm compatible, remove the
algorithm
  * deprecate the existing MR algorithms, yet still take bug fix
contributions
  * remove Random Forest as we cannot even answer questions to the
implementation on the mailinglist

There are two more actions that I would like to see, but'd be willing to
give up if there are objections:

  * move the MR algorithms into a separate maven module
  * remove Frequent Pattern Mining again (we already aimed for that in 0.9
but had one user who shouted but never returned to us)

Let me know what you think.

--sebastian

Re: Tackling the "legacy dilemma"

Reply via email to