Hi Gokhan,

I like the idea, but I'm not sure whether its completely feasible for
all parts of Mahout. A lot of jobs need a little more than a matrix, for
example an additional dictionary for text-based stuff

In the collaborative filtering code, we already have a common input
format: All recommenders can work with textual files that have a
(user,item,rating) triple per line.

Internally the Hadoop stuff works on vectors, which are created by the
PreparePreferenceMatrixJob, but we found it easier to use the textual
format as input for the jobs.

So in summary, I think your refactoring is a good idea, but you should
choose a particular part of Mahout to start with, maybe by creating an
easy-to-use pipeline for LDA.

Best,
Sebastian

On 26.03.2013 21:35, Gokhan Capan wrote:
> I am moving my email that I wrote to Call to Action upon request.
> 
> I'll start with an example that I experience when I use Mahout, and list my
> humble suggestions.
> 
> When I try to run Latent Dirichlet Allocation for topic discovery, here are
> the steps  to follow:
> 
> 1- First I use seq2sparse to convert text to vectors. The output is Text,
> VectorWritable pairs (If I have a csv data file –which is understandable-,
> which has lines of id, text pairs, I need to develop my own tool to convert
> it to vectors.)
> 
> 2- I run LDA on data I transformed, but it doesn’t work, because LDA needs
> IntWritable, VectorWritable pairs.
> 
> 3- I convert Text keys to IntWritable ones with a custom tool.
> 
> 4- Then I run LDA, and to see the results, I need to run vectordump with
> sort flag (It usually throws OutOfMemoryError). An ldadump tool does not
> exist. What I see is fairly different from clusterdump results, so I spend
> some time to understand what that means. (And I need to know there exists a
> vectordump tool to see the results)
> 
> 5- After running LDA, when I have a document that I want to assign to a
> topic, there is no way -or I am not aware- to use my learned LDA model to
> assign this document to a topic.
> 
> I can give further examples, but I believe this will make my point clear.
> 
> 
> Would you consider to refactor Mahout, so that the project follows a clear,
> layered structure for all algorithms, and to document it?
> 
> IMO Knowledge Discovery process has a certain path, and Mahout can define
> rules, those would force developers and guide users. For example:
> 
> 
>    - All algorithms take Mahout matrices as input and output.
>    - All preprocessing tools should be generic enough, so that they produce
>    appropriate input for mahout algorithms.
>    - All algorithms should output a model that users can use them beyond
>    training and testing
>    - Tools those dump results should follow a strictly defined format
>    suggested by community
>    - All similar kinds of algorithms should use same evaluation tools
>    - ...
> 
> There may be separated layers: preprocessing layer, algorithms layer,
> evaluation layer, and so on.
> 
> This way users would be aware of the steps they need to perform, and one
> step can be replaced by an alternative.
> 
> Developers would contribute to the layer they feel comfortable with, and
> would satisfy the expected input and output, to preserve the integrity.
> 
> Mahout has tools for nearly all of these layers, but personally when I use
> Mahout (and I’ve been using it for a long time), I feel lost in the steps I
> should follow.
> 
> Moreover, the refactoring may eliminate duplicate data structures, and
> stick to Mahout matrices if available. All similarity measures operate on
> Mahout Vectors, for example.
> 
> We, in the lab and in our company, do some of that. An example:
> 
> We implemented an HBase backed Mahout Matrix, which we use for our projects
> where online learning algorithms operate on large input and learn a big
> parameter matrix (one needs this for matrix factorization based
> recommenders). Then the persistent parameter matrix becomes an input for
> the live system. Then we used the same matrix implementation as the
> underlying data store of Recommender DataModels. This was advantageous in
> many ways:
> 
>    - Everyone knows that any dataset should be in Mahout matrix format, and
>    applies appropriate preprocessing, or writes one
>    - We can use different recommenders interchangeably
>    - Any optimization on matrix operations apply everywhere
>    - Different people can work on different parts (evaluation, model
>    optimization, recommender algorithms) without bothering others
> 
> Apart from all, I should say that I am always eager to contribute to
> Mahout, as some of committers already know.
> 
> Best Regards
> 
> Gokhan
> 

Reply via email to