Re: Mahout Suggestions - Refactoring Effort

Sebastian Schelter Tue, 26 Mar 2013 23:15:15 -0700

Totally agree on that. The impact of making Mahout more usable is much
higher than that of adding a new algorithm.


On 27.03.2013 05:41, Ted Dunning wrote:
> It is critically important.
> 
> On Wed, Mar 27, 2013 at 2:14 AM, Marty Kube <
> [email protected]> wrote:
> 
>> IMHO usability is really important.    I've posted a couple of patches
>> recently around making the RF classifiers easier to use.  I found myself
>> working on consistent data format and command line option support. It's not
>> glamorous but it's important.
>>
>>
>> On 3/26/2013 8:26 PM, Ted Dunning wrote:
>>
>>> Gokhan,
>>>
>>> I think that the general drift of your recommendation is an excellent
>>> suggestion and it is something that we have wrestled with a lot over time.
>>>   The recommendations side of the house has more coherence in this matter
>>> than other parts largely because there was a clear flow early on.
>>>
>>> Now, however, the flow is becoming more clear for non-recommendation parts
>>> of the system.
>>>
>>> - we have 2-3 external kinds of input.  These include text and matrices.
>>>   Text comes in two major forms, those being text in files with
>>> unspecified
>>> separators and text in Lucene/Solr indexes.  Matrices come in several
>>> forms
>>> including triples, CSV files, binary matrices and sequence files of
>>> vectors.
>>>
>>> - there are currently only a few ways to convert text and external data to
>>> matrices.  The two most prominent are dictionary based and hashed
>>> encoding.
>>>   Hashed encoding is currently not as invertible as it should be.
>>>   Dictionary based has the virtue of being invertible, but hashed encoding
>>> has considerably more generality.  We have almost no support for multiple
>>> fields in dictionary based encoding.
>>>
>>> - good conversion backwards and forwards depends on having schema
>>> information that we don't retain or specify well.
>>>
>>> - knowledge discovery pathways need more flexibility than recommendation
>>> pathways regarding input and visualization.
>>>
>>> - key knowledge discovery pathways that I know about include (a) input
>>> summarization, (b) vectorization, (c) unsupervised analysis such as LDA,
>>> LLL, clustering, SVD, (d) supervised training such as SGD, Naive Bayes and
>>> random forest, and (e) visualization of results
>>>
>>> I see that the major problems in Mahout are what Gokhan said, but with a
>>> few extras
>>>
>>> 1) as Gokhan said, the exploratory pathways are inconsistent
>>>
>>> 2) I think that our visualization pathways are also hideous
>>>
>>> 3) I think that we need a good document format with a reasonable schema.
>>>   Rather than create such a thing, I would nominate Lucene/Solr indexes
>>> as a
>>> first class object in Mahout.
>>>
>>> 4) our current command lines with all the (many) different options with
>>> incompatible conventions is a bit of a shambles
>>>
>>> Expressed this way, I think that these usability issues are fixable.
>>>
>>> What does everybody else think?  Would this leave us with a significantly
>>> better system?
>>>
>>>
>>>
>>> On Tue, Mar 26, 2013 at 9:35 PM, Gokhan Capan <[email protected]> wrote:
>>>
>>>  I am moving my email that I wrote to Call to Action upon request.
>>>>
>>>> I'll start with an example that I experience when I use Mahout, and list
>>>> my
>>>> humble suggestions.
>>>>
>>>> When I try to run Latent Dirichlet Allocation for topic discovery, here
>>>> are
>>>> the steps  to follow:
>>>>
>>>> 1- First I use seq2sparse to convert text to vectors. The output is Text,
>>>> VectorWritable pairs (If I have a csv data file –which is
>>>> understandable-,
>>>> which has lines of id, text pairs, I need to develop my own tool to
>>>> convert
>>>> it to vectors.)
>>>>
>>>> 2- I run LDA on data I transformed, but it doesn’t work, because LDA
>>>> needs
>>>> IntWritable, VectorWritable pairs.
>>>>
>>>> 3- I convert Text keys to IntWritable ones with a custom tool.
>>>>
>>>> 4- Then I run LDA, and to see the results, I need to run vectordump with
>>>> sort flag (It usually throws OutOfMemoryError). An ldadump tool does not
>>>> exist. What I see is fairly different from clusterdump results, so I
>>>> spend
>>>> some time to understand what that means. (And I need to know there
>>>> exists a
>>>> vectordump tool to see the results)
>>>>
>>>> 5- After running LDA, when I have a document that I want to assign to a
>>>> topic, there is no way -or I am not aware- to use my learned LDA model to
>>>> assign this document to a topic.
>>>>
>>>> I can give further examples, but I believe this will make my point clear.
>>>>
>>>>
>>>> Would you consider to refactor Mahout, so that the project follows a
>>>> clear,
>>>> layered structure for all algorithms, and to document it?
>>>>
>>>> IMO Knowledge Discovery process has a certain path, and Mahout can define
>>>> rules, those would force developers and guide users. For example:
>>>>
>>>>
>>>>     - All algorithms take Mahout matrices as input and output.
>>>>     - All preprocessing tools should be generic enough, so that they
>>>> produce
>>>>     appropriate input for mahout algorithms.
>>>>     - All algorithms should output a model that users can use them beyond
>>>>     training and testing
>>>>     - Tools those dump results should follow a strictly defined format
>>>>     suggested by community
>>>>     - All similar kinds of algorithms should use same evaluation tools
>>>>     - ...
>>>>
>>>> There may be separated layers: preprocessing layer, algorithms layer,
>>>> evaluation layer, and so on.
>>>>
>>>> This way users would be aware of the steps they need to perform, and one
>>>> step can be replaced by an alternative.
>>>>
>>>> Developers would contribute to the layer they feel comfortable with, and
>>>> would satisfy the expected input and output, to preserve the integrity.
>>>>
>>>> Mahout has tools for nearly all of these layers, but personally when I
>>>> use
>>>> Mahout (and I’ve been using it for a long time), I feel lost in the
>>>> steps I
>>>> should follow.
>>>>
>>>> Moreover, the refactoring may eliminate duplicate data structures, and
>>>> stick to Mahout matrices if available. All similarity measures operate on
>>>> Mahout Vectors, for example.
>>>>
>>>> We, in the lab and in our company, do some of that. An example:
>>>>
>>>> We implemented an HBase backed Mahout Matrix, which we use for our
>>>> projects
>>>> where online learning algorithms operate on large input and learn a big
>>>> parameter matrix (one needs this for matrix factorization based
>>>> recommenders). Then the persistent parameter matrix becomes an input for
>>>> the live system. Then we used the same matrix implementation as the
>>>> underlying data store of Recommender DataModels. This was advantageous in
>>>> many ways:
>>>>
>>>>     - Everyone knows that any dataset should be in Mahout matrix format,
>>>> and
>>>>     applies appropriate preprocessing, or writes one
>>>>     - We can use different recommenders interchangeably
>>>>     - Any optimization on matrix operations apply everywhere
>>>>     - Different people can work on different parts (evaluation, model
>>>>     optimization, recommender algorithms) without bothering others
>>>>
>>>> Apart from all, I should say that I am always eager to contribute to
>>>> Mahout, as some of committers already know.
>>>>
>>>> Best Regards
>>>>
>>>> Gokhan
>>>>
>>>>
>>
>

Re: Mahout Suggestions - Refactoring Effort

Reply via email to