Re: Mahout Suggestions - Refactoring Effort

Gokhan Capan Tue, 26 Mar 2013 15:14:53 -0700

What I actually believe is to define a set of design principles for Mahout.


This may be per different kinds of algorithms, may be a set for
unsupervised ones, supervised ones, and dyadic prediction algorithms
(recommenders), I am not sure yet.

I consider myself a 'power user' of Mahout, and a Mahout enthusiast without
letting committers know:) I talk to many people who reach me to use Mahout
for their internal commercial projects, I mentor some undergrads who want
to use Mahout for their school projects, and I contribute myself from time
to time.

I actually have a lot of pre and post processing tools to make Mahout
easier to use, and if there is an effort to make the architecture more
precise, I will definitely join.

These were what I collected from people's feedback, and it makes me feel
sad to see excellent algorithms cannot be used just because the usage path
is blurry.


On Tue, Mar 26, 2013 at 11:56 PM, Sebastian Schelter <[email protected]> wrote:

> Hi Gokhan,
>
> I like the idea, but I'm not sure whether its completely feasible for
> all parts of Mahout. A lot of jobs need a little more than a matrix, for
> example an additional dictionary for text-based stuff
>
> In the collaborative filtering code, we already have a common input
> format: All recommenders can work with textual files that have a
> (user,item,rating) triple per line.
>
> Internally the Hadoop stuff works on vectors, which are created by the
> PreparePreferenceMatrixJob, but we found it easier to use the textual
> format as input for the jobs.
>
> So in summary, I think your refactoring is a good idea, but you should
> choose a particular part of Mahout to start with, maybe by creating an
> easy-to-use pipeline for LDA.
>
> Best,
> Sebastian
>
> On 26.03.2013 21:35, Gokhan Capan wrote:
> > I am moving my email that I wrote to Call to Action upon request.
> >
> > I'll start with an example that I experience when I use Mahout, and list
> my
> > humble suggestions.
> >
> > When I try to run Latent Dirichlet Allocation for topic discovery, here
> are
> > the steps  to follow:
> >
> > 1- First I use seq2sparse to convert text to vectors. The output is Text,
> > VectorWritable pairs (If I have a csv data file –which is
> understandable-,
> > which has lines of id, text pairs, I need to develop my own tool to
> convert
> > it to vectors.)
> >
> > 2- I run LDA on data I transformed, but it doesn’t work, because LDA
> needs
> > IntWritable, VectorWritable pairs.
> >
> > 3- I convert Text keys to IntWritable ones with a custom tool.
> >
> > 4- Then I run LDA, and to see the results, I need to run vectordump with
> > sort flag (It usually throws OutOfMemoryError). An ldadump tool does not
> > exist. What I see is fairly different from clusterdump results, so I
> spend
> > some time to understand what that means. (And I need to know there
> exists a
> > vectordump tool to see the results)
> >
> > 5- After running LDA, when I have a document that I want to assign to a
> > topic, there is no way -or I am not aware- to use my learned LDA model to
> > assign this document to a topic.
> >
> > I can give further examples, but I believe this will make my point clear.
> >
> >
> > Would you consider to refactor Mahout, so that the project follows a
> clear,
> > layered structure for all algorithms, and to document it?
> >
> > IMO Knowledge Discovery process has a certain path, and Mahout can define
> > rules, those would force developers and guide users. For example:
> >
> >
> >    - All algorithms take Mahout matrices as input and output.
> >    - All preprocessing tools should be generic enough, so that they
> produce
> >    appropriate input for mahout algorithms.
> >    - All algorithms should output a model that users can use them beyond
> >    training and testing
> >    - Tools those dump results should follow a strictly defined format
> >    suggested by community
> >    - All similar kinds of algorithms should use same evaluation tools
> >    - ...
> >
> > There may be separated layers: preprocessing layer, algorithms layer,
> > evaluation layer, and so on.
> >
> > This way users would be aware of the steps they need to perform, and one
> > step can be replaced by an alternative.
> >
> > Developers would contribute to the layer they feel comfortable with, and
> > would satisfy the expected input and output, to preserve the integrity.
> >
> > Mahout has tools for nearly all of these layers, but personally when I
> use
> > Mahout (and I’ve been using it for a long time), I feel lost in the
> steps I
> > should follow.
> >
> > Moreover, the refactoring may eliminate duplicate data structures, and
> > stick to Mahout matrices if available. All similarity measures operate on
> > Mahout Vectors, for example.
> >
> > We, in the lab and in our company, do some of that. An example:
> >
> > We implemented an HBase backed Mahout Matrix, which we use for our
> projects
> > where online learning algorithms operate on large input and learn a big
> > parameter matrix (one needs this for matrix factorization based
> > recommenders). Then the persistent parameter matrix becomes an input for
> > the live system. Then we used the same matrix implementation as the
> > underlying data store of Recommender DataModels. This was advantageous in
> > many ways:
> >
> >    - Everyone knows that any dataset should be in Mahout matrix format,
> and
> >    applies appropriate preprocessing, or writes one
> >    - We can use different recommenders interchangeably
> >    - Any optimization on matrix operations apply everywhere
> >    - Different people can work on different parts (evaluation, model
> >    optimization, recommender algorithms) without bothering others
> >
> > Apart from all, I should say that I am always eager to contribute to
> > Mahout, as some of committers already know.
> >
> > Best Regards
> >
> > Gokhan
> >
>
>


-- 
Gokhan

Re: Mahout Suggestions - Refactoring Effort

Reply via email to