Hi,

Would you consider to refactor Mahout, so that the project follows a clear,
layered structure for all algorithms, and to document it, such as:


   - All algorithms take Mahout matrices as input, and outputs matrices as
   learned model
   - All preprocessing tools should be generic enough, so that they produce
   appropriate inputs for mahout algorithms
   - All algorithms should output the learned model so that people can use
   them beyond training and testing
   - Tools those dump results (e.g. clusterdump) should follow a strictly
   defined format suggested by community.
   - Evaluation tools should be generic enough so they can be used by all
   similar kinds of algorithms.
   - ...

Users would know the steps they need to perform to use Mahout, and one step
can be replaced by an alternative.

Developers would know the inputs and outputs of their contributions clearly
and they would contribute to the layer (preprocessing, algorithm, etc.)
they feel comfortable with.

Mahout has tools for nearly all of these steps listed here, but personally
when I use Mahout (and I’ve been using it for a long time), I feel lost in
the steps I should follow.

Moreover, the refactoring may eliminate duplicate data structures, and
stick to Mahout matrices if available. All similarity measures should
operate on vectors, for example.

An illustrating example: In our lab, we implemented an HBase backed Mahout
Matrix, which we use it for our projects where online algorithms operate on
large data and learn a parameter matrix (one needs this for matrix
factorization based recommenders). Then the parameter matrix becomes an
input for the live system. This refactoring cascaded, and we replaced
underlying data structures of Recommender DataModel with a persistent
matrix.

Now:


   - Everyone knows that any dataset should be in Mahout matrix format, and
   applies appropriate preprocessing, or writes one.
   - We can use different recommenders interchangeably
   - Any optimization on matrix operations apply everywhere.
   - Different people can work on different parts (evaluation, model
   optimization, recommender algorithms) without bothering others.

Apart from all, I should say that I am always eager to contribute to
Mahout, as some of committers already know.

Best Regards

On Tue, Mar 26, 2013 at 5:23 PM, Isabel Drost <isa...@apache.org> wrote:

> On Tue, Mar 26, 2013 at 3:59 PM, Grant Ingersoll <gsing...@apache.org
> >wrote:
>
> > I believe the GSOC proposal for Mentors is due soon, so if someone is
> > doing it, they better hop on comdev ASAP and submit.
> >
>
> For more information also check <http://community.apache.org/gsoc.html> -
> in particular the "for mentors" bit of the page.
>
>
> Isabel
>



-- 
Gokhan

Reply via email to