Re: Mahout Suggestions - Refactoring Effort

Ted Dunning Tue, 26 Mar 2013 23:38:50 -0700

Can you post a list of those patches?

I haven't been tracking carefully and unless I have a moment when the email
comes through (<10% chance lately) then I lose track.


On Wed, Mar 27, 2013 at 7:30 AM, Marty Kube <[email protected]>wrote:

> So I'd like to continue to improve the RF classifier code. I've been
> posting patches along the lines of the refactoring discussed here. The
> patches are not being looked at. Someone should be considering patches in
> this area.  Maybe I could handle that :-)
>
>
> Sent from my iPhone
>
> On Mar 27, 2013, at 12:14 AM, Sebastian Schelter <[email protected]> wrote:
>
> > Totally agree on that. The impact of making Mahout more usable is much
> > higher than that of adding a new algorithm.
> >
> > On 27.03.2013 05:41, Ted Dunning wrote:
> >> It is critically important.
> >>
> >> On Wed, Mar 27, 2013 at 2:14 AM, Marty Kube <
> >> [email protected]> wrote:
> >>
> >>> IMHO usability is really important.    I've posted a couple of patches
> >>> recently around making the RF classifiers easier to use.  I found
> myself
> >>> working on consistent data format and command line option support.
> It's not
> >>> glamorous but it's important.
> >>>
> >>>
> >>> On 3/26/2013 8:26 PM, Ted Dunning wrote:
> >>>
> >>>> Gokhan,
> >>>>
> >>>> I think that the general drift of your recommendation is an excellent
> >>>> suggestion and it is something that we have wrestled with a lot over
> time.
> >>>>  The recommendations side of the house has more coherence in this
> matter
> >>>> than other parts largely because there was a clear flow early on.
> >>>>
> >>>> Now, however, the flow is becoming more clear for non-recommendation
> parts
> >>>> of the system.
> >>>>
> >>>> - we have 2-3 external kinds of input.  These include text and
> matrices.
> >>>>  Text comes in two major forms, those being text in files with
> >>>> unspecified
> >>>> separators and text in Lucene/Solr indexes.  Matrices come in several
> >>>> forms
> >>>> including triples, CSV files, binary matrices and sequence files of
> >>>> vectors.
> >>>>
> >>>> - there are currently only a few ways to convert text and external
> data to
> >>>> matrices.  The two most prominent are dictionary based and hashed
> >>>> encoding.
> >>>>  Hashed encoding is currently not as invertible as it should be.
> >>>>  Dictionary based has the virtue of being invertible, but hashed
> encoding
> >>>> has considerably more generality.  We have almost no support for
> multiple
> >>>> fields in dictionary based encoding.
> >>>>
> >>>> - good conversion backwards and forwards depends on having schema
> >>>> information that we don't retain or specify well.
> >>>>
> >>>> - knowledge discovery pathways need more flexibility than
> recommendation
> >>>> pathways regarding input and visualization.
> >>>>
> >>>> - key knowledge discovery pathways that I know about include (a) input
> >>>> summarization, (b) vectorization, (c) unsupervised analysis such as
> LDA,
> >>>> LLL, clustering, SVD, (d) supervised training such as SGD, Naive
> Bayes and
> >>>> random forest, and (e) visualization of results
> >>>>
> >>>> I see that the major problems in Mahout are what Gokhan said, but
> with a
> >>>> few extras
> >>>>
> >>>> 1) as Gokhan said, the exploratory pathways are inconsistent
> >>>>
> >>>> 2) I think that our visualization pathways are also hideous
> >>>>
> >>>> 3) I think that we need a good document format with a reasonable
> schema.
> >>>>  Rather than create such a thing, I would nominate Lucene/Solr indexes
> >>>> as a
> >>>> first class object in Mahout.
> >>>>
> >>>> 4) our current command lines with all the (many) different options
> with
> >>>> incompatible conventions is a bit of a shambles
> >>>>
> >>>> Expressed this way, I think that these usability issues are fixable.
> >>>>
> >>>> What does everybody else think?  Would this leave us with a
> significantly
> >>>> better system?
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Mar 26, 2013 at 9:35 PM, Gokhan Capan <[email protected]>
> wrote:
> >>>>
> >>>> I am moving my email that I wrote to Call to Action upon request.
> >>>>>
> >>>>> I'll start with an example that I experience when I use Mahout, and
> list
> >>>>> my
> >>>>> humble suggestions.
> >>>>>
> >>>>> When I try to run Latent Dirichlet Allocation for topic discovery,
> here
> >>>>> are
> >>>>> the steps  to follow:
> >>>>>
> >>>>> 1- First I use seq2sparse to convert text to vectors. The output is
> Text,
> >>>>> VectorWritable pairs (If I have a csv data file –which is
> >>>>> understandable-,
> >>>>> which has lines of id, text pairs, I need to develop my own tool to
> >>>>> convert
> >>>>> it to vectors.)
> >>>>>
> >>>>> 2- I run LDA on data I transformed, but it doesn’t work, because LDA
> >>>>> needs
> >>>>> IntWritable, VectorWritable pairs.
> >>>>>
> >>>>> 3- I convert Text keys to IntWritable ones with a custom tool.
> >>>>>
> >>>>> 4- Then I run LDA, and to see the results, I need to run vectordump
> with
> >>>>> sort flag (It usually throws OutOfMemoryError). An ldadump tool does
> not
> >>>>> exist. What I see is fairly different from clusterdump results, so I
> >>>>> spend
> >>>>> some time to understand what that means. (And I need to know there
> >>>>> exists a
> >>>>> vectordump tool to see the results)
> >>>>>
> >>>>> 5- After running LDA, when I have a document that I want to assign
> to a
> >>>>> topic, there is no way -or I am not aware- to use my learned LDA
> model to
> >>>>> assign this document to a topic.
> >>>>>
> >>>>> I can give further examples, but I believe this will make my point
> clear.
> >>>>>
> >>>>>
> >>>>> Would you consider to refactor Mahout, so that the project follows a
> >>>>> clear,
> >>>>> layered structure for all algorithms, and to document it?
> >>>>>
> >>>>> IMO Knowledge Discovery process has a certain path, and Mahout can
> define
> >>>>> rules, those would force developers and guide users. For example:
> >>>>>
> >>>>>
> >>>>>    - All algorithms take Mahout matrices as input and output.
> >>>>>    - All preprocessing tools should be generic enough, so that they
> >>>>> produce
> >>>>>    appropriate input for mahout algorithms.
> >>>>>    - All algorithms should output a model that users can use them
> beyond
> >>>>>    training and testing
> >>>>>    - Tools those dump results should follow a strictly defined format
> >>>>>    suggested by community
> >>>>>    - All similar kinds of algorithms should use same evaluation tools
> >>>>>    - ...
> >>>>>
> >>>>> There may be separated layers: preprocessing layer, algorithms layer,
> >>>>> evaluation layer, and so on.
> >>>>>
> >>>>> This way users would be aware of the steps they need to perform, and
> one
> >>>>> step can be replaced by an alternative.
> >>>>>
> >>>>> Developers would contribute to the layer they feel comfortable with,
> and
> >>>>> would satisfy the expected input and output, to preserve the
> integrity.
> >>>>>
> >>>>> Mahout has tools for nearly all of these layers, but personally when
> I
> >>>>> use
> >>>>> Mahout (and I’ve been using it for a long time), I feel lost in the
> >>>>> steps I
> >>>>> should follow.
> >>>>>
> >>>>> Moreover, the refactoring may eliminate duplicate data structures,
> and
> >>>>> stick to Mahout matrices if available. All similarity measures
> operate on
> >>>>> Mahout Vectors, for example.
> >>>>>
> >>>>> We, in the lab and in our company, do some of that. An example:
> >>>>>
> >>>>> We implemented an HBase backed Mahout Matrix, which we use for our
> >>>>> projects
> >>>>> where online learning algorithms operate on large input and learn a
> big
> >>>>> parameter matrix (one needs this for matrix factorization based
> >>>>> recommenders). Then the persistent parameter matrix becomes an input
> for
> >>>>> the live system. Then we used the same matrix implementation as the
> >>>>> underlying data store of Recommender DataModels. This was
> advantageous in
> >>>>> many ways:
> >>>>>
> >>>>>    - Everyone knows that any dataset should be in Mahout matrix
> format,
> >>>>> and
> >>>>>    applies appropriate preprocessing, or writes one
> >>>>>    - We can use different recommenders interchangeably
> >>>>>    - Any optimization on matrix operations apply everywhere
> >>>>>    - Different people can work on different parts (evaluation, model
> >>>>>    optimization, recommender algorithms) without bothering others
> >>>>>
> >>>>> Apart from all, I should say that I am always eager to contribute to
> >>>>> Mahout, as some of committers already know.
> >>>>>
> >>>>> Best Regards
> >>>>>
> >>>>> Gokhan
> >
>

Re: Mahout Suggestions - Refactoring Effort

Reply via email to