Totally agree on that. The impact of making Mahout more usable is much higher than that of adding a new algorithm.
On 27.03.2013 05:41, Ted Dunning wrote: > It is critically important. > > On Wed, Mar 27, 2013 at 2:14 AM, Marty Kube < > [email protected]> wrote: > >> IMHO usability is really important. I've posted a couple of patches >> recently around making the RF classifiers easier to use. I found myself >> working on consistent data format and command line option support. It's not >> glamorous but it's important. >> >> >> On 3/26/2013 8:26 PM, Ted Dunning wrote: >> >>> Gokhan, >>> >>> I think that the general drift of your recommendation is an excellent >>> suggestion and it is something that we have wrestled with a lot over time. >>> The recommendations side of the house has more coherence in this matter >>> than other parts largely because there was a clear flow early on. >>> >>> Now, however, the flow is becoming more clear for non-recommendation parts >>> of the system. >>> >>> - we have 2-3 external kinds of input. These include text and matrices. >>> Text comes in two major forms, those being text in files with >>> unspecified >>> separators and text in Lucene/Solr indexes. Matrices come in several >>> forms >>> including triples, CSV files, binary matrices and sequence files of >>> vectors. >>> >>> - there are currently only a few ways to convert text and external data to >>> matrices. The two most prominent are dictionary based and hashed >>> encoding. >>> Hashed encoding is currently not as invertible as it should be. >>> Dictionary based has the virtue of being invertible, but hashed encoding >>> has considerably more generality. We have almost no support for multiple >>> fields in dictionary based encoding. >>> >>> - good conversion backwards and forwards depends on having schema >>> information that we don't retain or specify well. >>> >>> - knowledge discovery pathways need more flexibility than recommendation >>> pathways regarding input and visualization. >>> >>> - key knowledge discovery pathways that I know about include (a) input >>> summarization, (b) vectorization, (c) unsupervised analysis such as LDA, >>> LLL, clustering, SVD, (d) supervised training such as SGD, Naive Bayes and >>> random forest, and (e) visualization of results >>> >>> I see that the major problems in Mahout are what Gokhan said, but with a >>> few extras >>> >>> 1) as Gokhan said, the exploratory pathways are inconsistent >>> >>> 2) I think that our visualization pathways are also hideous >>> >>> 3) I think that we need a good document format with a reasonable schema. >>> Rather than create such a thing, I would nominate Lucene/Solr indexes >>> as a >>> first class object in Mahout. >>> >>> 4) our current command lines with all the (many) different options with >>> incompatible conventions is a bit of a shambles >>> >>> Expressed this way, I think that these usability issues are fixable. >>> >>> What does everybody else think? Would this leave us with a significantly >>> better system? >>> >>> >>> >>> On Tue, Mar 26, 2013 at 9:35 PM, Gokhan Capan <[email protected]> wrote: >>> >>> I am moving my email that I wrote to Call to Action upon request. >>>> >>>> I'll start with an example that I experience when I use Mahout, and list >>>> my >>>> humble suggestions. >>>> >>>> When I try to run Latent Dirichlet Allocation for topic discovery, here >>>> are >>>> the steps to follow: >>>> >>>> 1- First I use seq2sparse to convert text to vectors. The output is Text, >>>> VectorWritable pairs (If I have a csv data file –which is >>>> understandable-, >>>> which has lines of id, text pairs, I need to develop my own tool to >>>> convert >>>> it to vectors.) >>>> >>>> 2- I run LDA on data I transformed, but it doesn’t work, because LDA >>>> needs >>>> IntWritable, VectorWritable pairs. >>>> >>>> 3- I convert Text keys to IntWritable ones with a custom tool. >>>> >>>> 4- Then I run LDA, and to see the results, I need to run vectordump with >>>> sort flag (It usually throws OutOfMemoryError). An ldadump tool does not >>>> exist. What I see is fairly different from clusterdump results, so I >>>> spend >>>> some time to understand what that means. (And I need to know there >>>> exists a >>>> vectordump tool to see the results) >>>> >>>> 5- After running LDA, when I have a document that I want to assign to a >>>> topic, there is no way -or I am not aware- to use my learned LDA model to >>>> assign this document to a topic. >>>> >>>> I can give further examples, but I believe this will make my point clear. >>>> >>>> >>>> Would you consider to refactor Mahout, so that the project follows a >>>> clear, >>>> layered structure for all algorithms, and to document it? >>>> >>>> IMO Knowledge Discovery process has a certain path, and Mahout can define >>>> rules, those would force developers and guide users. For example: >>>> >>>> >>>> - All algorithms take Mahout matrices as input and output. >>>> - All preprocessing tools should be generic enough, so that they >>>> produce >>>> appropriate input for mahout algorithms. >>>> - All algorithms should output a model that users can use them beyond >>>> training and testing >>>> - Tools those dump results should follow a strictly defined format >>>> suggested by community >>>> - All similar kinds of algorithms should use same evaluation tools >>>> - ... >>>> >>>> There may be separated layers: preprocessing layer, algorithms layer, >>>> evaluation layer, and so on. >>>> >>>> This way users would be aware of the steps they need to perform, and one >>>> step can be replaced by an alternative. >>>> >>>> Developers would contribute to the layer they feel comfortable with, and >>>> would satisfy the expected input and output, to preserve the integrity. >>>> >>>> Mahout has tools for nearly all of these layers, but personally when I >>>> use >>>> Mahout (and I’ve been using it for a long time), I feel lost in the >>>> steps I >>>> should follow. >>>> >>>> Moreover, the refactoring may eliminate duplicate data structures, and >>>> stick to Mahout matrices if available. All similarity measures operate on >>>> Mahout Vectors, for example. >>>> >>>> We, in the lab and in our company, do some of that. An example: >>>> >>>> We implemented an HBase backed Mahout Matrix, which we use for our >>>> projects >>>> where online learning algorithms operate on large input and learn a big >>>> parameter matrix (one needs this for matrix factorization based >>>> recommenders). Then the persistent parameter matrix becomes an input for >>>> the live system. Then we used the same matrix implementation as the >>>> underlying data store of Recommender DataModels. This was advantageous in >>>> many ways: >>>> >>>> - Everyone knows that any dataset should be in Mahout matrix format, >>>> and >>>> applies appropriate preprocessing, or writes one >>>> - We can use different recommenders interchangeably >>>> - Any optimization on matrix operations apply everywhere >>>> - Different people can work on different parts (evaluation, model >>>> optimization, recommender algorithms) without bothering others >>>> >>>> Apart from all, I should say that I am always eager to contribute to >>>> Mahout, as some of committers already know. >>>> >>>> Best Regards >>>> >>>> Gokhan >>>> >>>> >> >
