Can you post a list of those patches? I haven't been tracking carefully and unless I have a moment when the email comes through (<10% chance lately) then I lose track.
On Wed, Mar 27, 2013 at 7:30 AM, Marty Kube <[email protected]>wrote: > So I'd like to continue to improve the RF classifier code. I've been > posting patches along the lines of the refactoring discussed here. The > patches are not being looked at. Someone should be considering patches in > this area. Maybe I could handle that :-) > > > Sent from my iPhone > > On Mar 27, 2013, at 12:14 AM, Sebastian Schelter <[email protected]> wrote: > > > Totally agree on that. The impact of making Mahout more usable is much > > higher than that of adding a new algorithm. > > > > On 27.03.2013 05:41, Ted Dunning wrote: > >> It is critically important. > >> > >> On Wed, Mar 27, 2013 at 2:14 AM, Marty Kube < > >> [email protected]> wrote: > >> > >>> IMHO usability is really important. I've posted a couple of patches > >>> recently around making the RF classifiers easier to use. I found > myself > >>> working on consistent data format and command line option support. > It's not > >>> glamorous but it's important. > >>> > >>> > >>> On 3/26/2013 8:26 PM, Ted Dunning wrote: > >>> > >>>> Gokhan, > >>>> > >>>> I think that the general drift of your recommendation is an excellent > >>>> suggestion and it is something that we have wrestled with a lot over > time. > >>>> The recommendations side of the house has more coherence in this > matter > >>>> than other parts largely because there was a clear flow early on. > >>>> > >>>> Now, however, the flow is becoming more clear for non-recommendation > parts > >>>> of the system. > >>>> > >>>> - we have 2-3 external kinds of input. These include text and > matrices. > >>>> Text comes in two major forms, those being text in files with > >>>> unspecified > >>>> separators and text in Lucene/Solr indexes. Matrices come in several > >>>> forms > >>>> including triples, CSV files, binary matrices and sequence files of > >>>> vectors. > >>>> > >>>> - there are currently only a few ways to convert text and external > data to > >>>> matrices. The two most prominent are dictionary based and hashed > >>>> encoding. > >>>> Hashed encoding is currently not as invertible as it should be. > >>>> Dictionary based has the virtue of being invertible, but hashed > encoding > >>>> has considerably more generality. We have almost no support for > multiple > >>>> fields in dictionary based encoding. > >>>> > >>>> - good conversion backwards and forwards depends on having schema > >>>> information that we don't retain or specify well. > >>>> > >>>> - knowledge discovery pathways need more flexibility than > recommendation > >>>> pathways regarding input and visualization. > >>>> > >>>> - key knowledge discovery pathways that I know about include (a) input > >>>> summarization, (b) vectorization, (c) unsupervised analysis such as > LDA, > >>>> LLL, clustering, SVD, (d) supervised training such as SGD, Naive > Bayes and > >>>> random forest, and (e) visualization of results > >>>> > >>>> I see that the major problems in Mahout are what Gokhan said, but > with a > >>>> few extras > >>>> > >>>> 1) as Gokhan said, the exploratory pathways are inconsistent > >>>> > >>>> 2) I think that our visualization pathways are also hideous > >>>> > >>>> 3) I think that we need a good document format with a reasonable > schema. > >>>> Rather than create such a thing, I would nominate Lucene/Solr indexes > >>>> as a > >>>> first class object in Mahout. > >>>> > >>>> 4) our current command lines with all the (many) different options > with > >>>> incompatible conventions is a bit of a shambles > >>>> > >>>> Expressed this way, I think that these usability issues are fixable. > >>>> > >>>> What does everybody else think? Would this leave us with a > significantly > >>>> better system? > >>>> > >>>> > >>>> > >>>> On Tue, Mar 26, 2013 at 9:35 PM, Gokhan Capan <[email protected]> > wrote: > >>>> > >>>> I am moving my email that I wrote to Call to Action upon request. > >>>>> > >>>>> I'll start with an example that I experience when I use Mahout, and > list > >>>>> my > >>>>> humble suggestions. > >>>>> > >>>>> When I try to run Latent Dirichlet Allocation for topic discovery, > here > >>>>> are > >>>>> the steps to follow: > >>>>> > >>>>> 1- First I use seq2sparse to convert text to vectors. The output is > Text, > >>>>> VectorWritable pairs (If I have a csv data file –which is > >>>>> understandable-, > >>>>> which has lines of id, text pairs, I need to develop my own tool to > >>>>> convert > >>>>> it to vectors.) > >>>>> > >>>>> 2- I run LDA on data I transformed, but it doesn’t work, because LDA > >>>>> needs > >>>>> IntWritable, VectorWritable pairs. > >>>>> > >>>>> 3- I convert Text keys to IntWritable ones with a custom tool. > >>>>> > >>>>> 4- Then I run LDA, and to see the results, I need to run vectordump > with > >>>>> sort flag (It usually throws OutOfMemoryError). An ldadump tool does > not > >>>>> exist. What I see is fairly different from clusterdump results, so I > >>>>> spend > >>>>> some time to understand what that means. (And I need to know there > >>>>> exists a > >>>>> vectordump tool to see the results) > >>>>> > >>>>> 5- After running LDA, when I have a document that I want to assign > to a > >>>>> topic, there is no way -or I am not aware- to use my learned LDA > model to > >>>>> assign this document to a topic. > >>>>> > >>>>> I can give further examples, but I believe this will make my point > clear. > >>>>> > >>>>> > >>>>> Would you consider to refactor Mahout, so that the project follows a > >>>>> clear, > >>>>> layered structure for all algorithms, and to document it? > >>>>> > >>>>> IMO Knowledge Discovery process has a certain path, and Mahout can > define > >>>>> rules, those would force developers and guide users. For example: > >>>>> > >>>>> > >>>>> - All algorithms take Mahout matrices as input and output. > >>>>> - All preprocessing tools should be generic enough, so that they > >>>>> produce > >>>>> appropriate input for mahout algorithms. > >>>>> - All algorithms should output a model that users can use them > beyond > >>>>> training and testing > >>>>> - Tools those dump results should follow a strictly defined format > >>>>> suggested by community > >>>>> - All similar kinds of algorithms should use same evaluation tools > >>>>> - ... > >>>>> > >>>>> There may be separated layers: preprocessing layer, algorithms layer, > >>>>> evaluation layer, and so on. > >>>>> > >>>>> This way users would be aware of the steps they need to perform, and > one > >>>>> step can be replaced by an alternative. > >>>>> > >>>>> Developers would contribute to the layer they feel comfortable with, > and > >>>>> would satisfy the expected input and output, to preserve the > integrity. > >>>>> > >>>>> Mahout has tools for nearly all of these layers, but personally when > I > >>>>> use > >>>>> Mahout (and I’ve been using it for a long time), I feel lost in the > >>>>> steps I > >>>>> should follow. > >>>>> > >>>>> Moreover, the refactoring may eliminate duplicate data structures, > and > >>>>> stick to Mahout matrices if available. All similarity measures > operate on > >>>>> Mahout Vectors, for example. > >>>>> > >>>>> We, in the lab and in our company, do some of that. An example: > >>>>> > >>>>> We implemented an HBase backed Mahout Matrix, which we use for our > >>>>> projects > >>>>> where online learning algorithms operate on large input and learn a > big > >>>>> parameter matrix (one needs this for matrix factorization based > >>>>> recommenders). Then the persistent parameter matrix becomes an input > for > >>>>> the live system. Then we used the same matrix implementation as the > >>>>> underlying data store of Recommender DataModels. This was > advantageous in > >>>>> many ways: > >>>>> > >>>>> - Everyone knows that any dataset should be in Mahout matrix > format, > >>>>> and > >>>>> applies appropriate preprocessing, or writes one > >>>>> - We can use different recommenders interchangeably > >>>>> - Any optimization on matrix operations apply everywhere > >>>>> - Different people can work on different parts (evaluation, model > >>>>> optimization, recommender algorithms) without bothering others > >>>>> > >>>>> Apart from all, I should say that I am always eager to contribute to > >>>>> Mahout, as some of committers already know. > >>>>> > >>>>> Best Regards > >>>>> > >>>>> Gokhan > > >
