On Jun 21, 2010, at 1:12 PM, Ted Dunning wrote: > We are now beginning to have lots of classifiers in Mahout. The naive > Bayes, complementary naive Bayes and random Forest grandfathers have been > joined by my recent SGD and Zhao Zhendong's prolific set of approaches for > logistic regression and SVM variants. > > All of these implementations have similar characteristics and virtually none > are inter-operable. > > Even worse, the model produced by a clustering system is really just like a > model produced by a classifier so we should increase the number of sources > of incompatible classifiers even more. Altogether, we probably have a dozen > ways of building classifiers. > > I would like to start a discussion about a framework that we can fit all of > these approaches together in much the same way that the recommendations > stuff has such nice pluggable properties. > > As I see it, the opportunities for commonality (aka our current > deficiencies) include: > > - original input format reading > > -- the naive Bayes code uses an ad hoc format similar to what Jason Rennie > used for 20 news groups. This code uses Lucene 3.0 style analyzers. > > -- Zhao uses something a lot like SVMLight input format > > -- The SGD code looks at CSV data > > -- Drew wrote some Avro document code > > -- Lucene has been used as a sort of vectors for clustering > > My summary here is that the Lucene analyzers look like they could be used > very effectively for our purposes. We would need to write AttributeFilter's > that do two kinds of vectorization (random project and dictionary based). > We also should have 4 standard input format parsers as examples (CSV, > SVMLight, VowpalWabbit, current naive Bayes format). > > We need something simply and general that subsumes all of these input use > cases. > > - conversion to vectors > > -- SGD introduced from random projection > > -- Naive bayes has some dictionary based conversions > > -- Other stuff does this or that > > This should be subsumed into the AttributeFilters that I mentioned above. > We really just need random projection and Salton style vector space models. > Clearly, we should allow direct input of vectors as well in case the user > is producing them for us. > > - command line option processing > > We really need to have a simple way to integrate all of the input processing > options easily into new and old code
More or less, what we need is a pipeline that can ingest many different kinds of things and output Vectors, right (assuming bayes is converted to use vectors). Ideally it would be easy to configure, work well in a cluster and can output various formats (for instance freq. item set as well). > > - model storage > > It would be lovely if we could instantiate a model from a stored form > without even known what kind of learning produced the model. All of the > classifiers and clustering algorithms should put out something that can be > instantiated this way. I used Gson in the SGD code and found it pretty > congenial, but I didn't encode the class of the classifier, nor did I > provide a classifier abstract class. I don't know what k-means or Canopy > clustering produce, nor random forests or Naive Bayes, but I am sure that > all of them are highly specific to the particular kind of model. Just to be clear, are you suggesting that, ultimately, the models can be used interchangeably? > > I don't know what is best here, but we definitely need something more common > than what we have. > > What do others think? Definitely agree.