Gokhan, I totally agree that we need of all that. Would you mind
starting a new thread about this?
This thread is great for listing ideas, but it's already become pretty
long and it's getting hard to keep track.

On Tue, Mar 26, 2013 at 6:38 PM, Gokhan Capan <gkhn...@gmail.com> wrote:
> Hi,
>
> Would you consider to refactor Mahout, so that the project follows a clear,
> layered structure for all algorithms, and to document it, such as:
>
>
>    - All algorithms take Mahout matrices as input, and outputs matrices as
>    learned model
>    - All preprocessing tools should be generic enough, so that they produce
>    appropriate inputs for mahout algorithms
>    - All algorithms should output the learned model so that people can use
>    them beyond training and testing
>    - Tools those dump results (e.g. clusterdump) should follow a strictly
>    defined format suggested by community.
>    - Evaluation tools should be generic enough so they can be used by all
>    similar kinds of algorithms.
>    - ...
>
> Users would know the steps they need to perform to use Mahout, and one step
> can be replaced by an alternative.
>
> Developers would know the inputs and outputs of their contributions clearly
> and they would contribute to the layer (preprocessing, algorithm, etc.)
> they feel comfortable with.
>
> Mahout has tools for nearly all of these steps listed here, but personally
> when I use Mahout (and I’ve been using it for a long time), I feel lost in
> the steps I should follow.
>
> Moreover, the refactoring may eliminate duplicate data structures, and
> stick to Mahout matrices if available. All similarity measures should
> operate on vectors, for example.
>
> An illustrating example: In our lab, we implemented an HBase backed Mahout
> Matrix, which we use it for our projects where online algorithms operate on
> large data and learn a parameter matrix (one needs this for matrix
> factorization based recommenders). Then the parameter matrix becomes an
> input for the live system. This refactoring cascaded, and we replaced
> underlying data structures of Recommender DataModel with a persistent
> matrix.
>
> Now:
>
>
>    - Everyone knows that any dataset should be in Mahout matrix format, and
>    applies appropriate preprocessing, or writes one.
>    - We can use different recommenders interchangeably
>    - Any optimization on matrix operations apply everywhere.
>    - Different people can work on different parts (evaluation, model
>    optimization, recommender algorithms) without bothering others.
>
> Apart from all, I should say that I am always eager to contribute to
> Mahout, as some of committers already know.
>
> Best Regards
>
> On Tue, Mar 26, 2013 at 5:23 PM, Isabel Drost <isa...@apache.org> wrote:
>
>> On Tue, Mar 26, 2013 at 3:59 PM, Grant Ingersoll <gsing...@apache.org
>> >wrote:
>>
>> > I believe the GSOC proposal for Mentors is due soon, so if someone is
>> > doing it, they better hop on comdev ASAP and submit.
>> >
>>
>> For more information also check <http://community.apache.org/gsoc.html> -
>> in particular the "for mentors" bit of the page.
>>
>>
>> Isabel
>>
>
>
>
> --
> Gokhan

Reply via email to