Gokhan, I totally agree that we need of all that. Would you mind starting a new thread about this? This thread is great for listing ideas, but it's already become pretty long and it's getting hard to keep track.
On Tue, Mar 26, 2013 at 6:38 PM, Gokhan Capan <gkhn...@gmail.com> wrote: > Hi, > > Would you consider to refactor Mahout, so that the project follows a clear, > layered structure for all algorithms, and to document it, such as: > > > - All algorithms take Mahout matrices as input, and outputs matrices as > learned model > - All preprocessing tools should be generic enough, so that they produce > appropriate inputs for mahout algorithms > - All algorithms should output the learned model so that people can use > them beyond training and testing > - Tools those dump results (e.g. clusterdump) should follow a strictly > defined format suggested by community. > - Evaluation tools should be generic enough so they can be used by all > similar kinds of algorithms. > - ... > > Users would know the steps they need to perform to use Mahout, and one step > can be replaced by an alternative. > > Developers would know the inputs and outputs of their contributions clearly > and they would contribute to the layer (preprocessing, algorithm, etc.) > they feel comfortable with. > > Mahout has tools for nearly all of these steps listed here, but personally > when I use Mahout (and I’ve been using it for a long time), I feel lost in > the steps I should follow. > > Moreover, the refactoring may eliminate duplicate data structures, and > stick to Mahout matrices if available. All similarity measures should > operate on vectors, for example. > > An illustrating example: In our lab, we implemented an HBase backed Mahout > Matrix, which we use it for our projects where online algorithms operate on > large data and learn a parameter matrix (one needs this for matrix > factorization based recommenders). Then the parameter matrix becomes an > input for the live system. This refactoring cascaded, and we replaced > underlying data structures of Recommender DataModel with a persistent > matrix. > > Now: > > > - Everyone knows that any dataset should be in Mahout matrix format, and > applies appropriate preprocessing, or writes one. > - We can use different recommenders interchangeably > - Any optimization on matrix operations apply everywhere. > - Different people can work on different parts (evaluation, model > optimization, recommender algorithms) without bothering others. > > Apart from all, I should say that I am always eager to contribute to > Mahout, as some of committers already know. > > Best Regards > > On Tue, Mar 26, 2013 at 5:23 PM, Isabel Drost <isa...@apache.org> wrote: > >> On Tue, Mar 26, 2013 at 3:59 PM, Grant Ingersoll <gsing...@apache.org >> >wrote: >> >> > I believe the GSOC proposal for Mentors is due soon, so if someone is >> > doing it, they better hop on comdev ASAP and submit. >> > >> >> For more information also check <http://community.apache.org/gsoc.html> - >> in particular the "for mentors" bit of the page. >> >> >> Isabel >> > > > > -- > Gokhan