On Jun 21, 2010, at 1:12 PM, Ted Dunning wrote:

> We are now beginning to have lots of classifiers in Mahout.  The naive
> Bayes, complementary naive Bayes and random Forest grandfathers have been
> joined by my recent SGD and Zhao Zhendong's prolific set of approaches for
> logistic regression and SVM variants.
> 
> All of these implementations have similar characteristics and virtually none
> are inter-operable.
> 
> Even worse, the model produced by a clustering system is really just like a
> model produced by a classifier so we should increase the number of sources
> of incompatible classifiers even more.  Altogether, we probably have a dozen
> ways of building classifiers.
> 
> I would like to start a discussion about a framework that we can fit all of
> these approaches together in much the same way that the recommendations
> stuff has such nice pluggable properties.
> 
> As I see it, the opportunities for commonality (aka our current
> deficiencies)  include:
> 
> - original input format reading
> 
> -- the naive Bayes code uses an ad hoc format similar to what Jason Rennie
> used for 20 news groups.  This code uses Lucene 3.0 style analyzers.
> 
> -- Zhao uses something a lot like SVMLight input format
> 
> -- The SGD code looks at CSV data
> 
> -- Drew wrote some Avro document code
> 
> -- Lucene has been used as a sort of vectors for clustering
> 
> My summary here is that the Lucene analyzers look like they could be used
> very effectively for our purposes.  We would need to write AttributeFilter's
> that do two kinds of vectorization (random project and dictionary based).
> We also should have 4 standard input format parsers as examples (CSV,
> SVMLight, VowpalWabbit, current naive Bayes format).
> 
> We need something simply and general that subsumes all of these input use
> cases.
> 
> - conversion to vectors
> 
> -- SGD introduced from random projection
> 
> -- Naive bayes has some dictionary based conversions
> 
> -- Other stuff does this or that
> 
> This should be subsumed into the AttributeFilters that I mentioned above.
> We really just need random projection and Salton style vector space models.
> Clearly, we should allow direct input of vectors as well in case the user
> is producing them for us.
> 
> - command line option processing
> 
> We really need to have a simple way to integrate all of the input processing
> options easily into new and old code

More or less, what we need is a pipeline that can ingest many different kinds 
of things and output Vectors, right (assuming bayes is converted to use 
vectors).  Ideally it would be easy to configure, work well in a cluster and 
can output various formats (for instance freq. item set as well).



> 
> - model storage
> 
> It would be lovely if we could instantiate a model from a stored form
> without even known what kind of learning produced the model.  All of the
> classifiers and clustering algorithms should put out something that can be
> instantiated this way.  I used Gson in the SGD code and found it pretty
> congenial, but I didn't encode the class of the classifier, nor did I
> provide a classifier abstract class.  I don't know what k-means or Canopy
> clustering produce, nor random forests or Naive Bayes, but I am sure that
> all of them are highly specific to the particular kind of model.

Just to be clear, are you suggesting that, ultimately, the models can be used 
interchangeably?

> 
> I don't know what is best here, but we definitely need something more common
> than what we have.
> 
> What do others think?


Definitely agree.

Reply via email to