It is critically important. On Wed, Mar 27, 2013 at 2:14 AM, Marty Kube < martyk...@beavercreekconsulting.com> wrote:
> IMHO usability is really important. I've posted a couple of patches > recently around making the RF classifiers easier to use. I found myself > working on consistent data format and command line option support. It's not > glamorous but it's important. > > > On 3/26/2013 8:26 PM, Ted Dunning wrote: > >> Gokhan, >> >> I think that the general drift of your recommendation is an excellent >> suggestion and it is something that we have wrestled with a lot over time. >> The recommendations side of the house has more coherence in this matter >> than other parts largely because there was a clear flow early on. >> >> Now, however, the flow is becoming more clear for non-recommendation parts >> of the system. >> >> - we have 2-3 external kinds of input. These include text and matrices. >> Text comes in two major forms, those being text in files with >> unspecified >> separators and text in Lucene/Solr indexes. Matrices come in several >> forms >> including triples, CSV files, binary matrices and sequence files of >> vectors. >> >> - there are currently only a few ways to convert text and external data to >> matrices. The two most prominent are dictionary based and hashed >> encoding. >> Hashed encoding is currently not as invertible as it should be. >> Dictionary based has the virtue of being invertible, but hashed encoding >> has considerably more generality. We have almost no support for multiple >> fields in dictionary based encoding. >> >> - good conversion backwards and forwards depends on having schema >> information that we don't retain or specify well. >> >> - knowledge discovery pathways need more flexibility than recommendation >> pathways regarding input and visualization. >> >> - key knowledge discovery pathways that I know about include (a) input >> summarization, (b) vectorization, (c) unsupervised analysis such as LDA, >> LLL, clustering, SVD, (d) supervised training such as SGD, Naive Bayes and >> random forest, and (e) visualization of results >> >> I see that the major problems in Mahout are what Gokhan said, but with a >> few extras >> >> 1) as Gokhan said, the exploratory pathways are inconsistent >> >> 2) I think that our visualization pathways are also hideous >> >> 3) I think that we need a good document format with a reasonable schema. >> Rather than create such a thing, I would nominate Lucene/Solr indexes >> as a >> first class object in Mahout. >> >> 4) our current command lines with all the (many) different options with >> incompatible conventions is a bit of a shambles >> >> Expressed this way, I think that these usability issues are fixable. >> >> What does everybody else think? Would this leave us with a significantly >> better system? >> >> >> >> On Tue, Mar 26, 2013 at 9:35 PM, Gokhan Capan <gkhn...@gmail.com> wrote: >> >> I am moving my email that I wrote to Call to Action upon request. >>> >>> I'll start with an example that I experience when I use Mahout, and list >>> my >>> humble suggestions. >>> >>> When I try to run Latent Dirichlet Allocation for topic discovery, here >>> are >>> the steps to follow: >>> >>> 1- First I use seq2sparse to convert text to vectors. The output is Text, >>> VectorWritable pairs (If I have a csv data file –which is >>> understandable-, >>> which has lines of id, text pairs, I need to develop my own tool to >>> convert >>> it to vectors.) >>> >>> 2- I run LDA on data I transformed, but it doesn’t work, because LDA >>> needs >>> IntWritable, VectorWritable pairs. >>> >>> 3- I convert Text keys to IntWritable ones with a custom tool. >>> >>> 4- Then I run LDA, and to see the results, I need to run vectordump with >>> sort flag (It usually throws OutOfMemoryError). An ldadump tool does not >>> exist. What I see is fairly different from clusterdump results, so I >>> spend >>> some time to understand what that means. (And I need to know there >>> exists a >>> vectordump tool to see the results) >>> >>> 5- After running LDA, when I have a document that I want to assign to a >>> topic, there is no way -or I am not aware- to use my learned LDA model to >>> assign this document to a topic. >>> >>> I can give further examples, but I believe this will make my point clear. >>> >>> >>> Would you consider to refactor Mahout, so that the project follows a >>> clear, >>> layered structure for all algorithms, and to document it? >>> >>> IMO Knowledge Discovery process has a certain path, and Mahout can define >>> rules, those would force developers and guide users. For example: >>> >>> >>> - All algorithms take Mahout matrices as input and output. >>> - All preprocessing tools should be generic enough, so that they >>> produce >>> appropriate input for mahout algorithms. >>> - All algorithms should output a model that users can use them beyond >>> training and testing >>> - Tools those dump results should follow a strictly defined format >>> suggested by community >>> - All similar kinds of algorithms should use same evaluation tools >>> - ... >>> >>> There may be separated layers: preprocessing layer, algorithms layer, >>> evaluation layer, and so on. >>> >>> This way users would be aware of the steps they need to perform, and one >>> step can be replaced by an alternative. >>> >>> Developers would contribute to the layer they feel comfortable with, and >>> would satisfy the expected input and output, to preserve the integrity. >>> >>> Mahout has tools for nearly all of these layers, but personally when I >>> use >>> Mahout (and I’ve been using it for a long time), I feel lost in the >>> steps I >>> should follow. >>> >>> Moreover, the refactoring may eliminate duplicate data structures, and >>> stick to Mahout matrices if available. All similarity measures operate on >>> Mahout Vectors, for example. >>> >>> We, in the lab and in our company, do some of that. An example: >>> >>> We implemented an HBase backed Mahout Matrix, which we use for our >>> projects >>> where online learning algorithms operate on large input and learn a big >>> parameter matrix (one needs this for matrix factorization based >>> recommenders). Then the persistent parameter matrix becomes an input for >>> the live system. Then we used the same matrix implementation as the >>> underlying data store of Recommender DataModels. This was advantageous in >>> many ways: >>> >>> - Everyone knows that any dataset should be in Mahout matrix format, >>> and >>> applies appropriate preprocessing, or writes one >>> - We can use different recommenders interchangeably >>> - Any optimization on matrix operations apply everywhere >>> - Different people can work on different parts (evaluation, model >>> optimization, recommender algorithms) without bothering others >>> >>> Apart from all, I should say that I am always eager to contribute to >>> Mahout, as some of committers already know. >>> >>> Best Regards >>> >>> Gokhan >>> >>> >