I believe most of this new NB discussion has been over chat. So here is the
state of the NB universe from my view
1) Original NB and CNB code was as follows
- Tokenize and find all possible collocations
- Compute Tf and Idf for each ngram
- Compute Global and per class sums for tf, idf and tf-idf
- Dump these counts in SequenceFiles
- Load these into memory or Hbase and do compute score for each
vector-label combination
Issues
A) its slow (collocation thought its efficient in its implementation (zero
memory overhead, using secondary sort), just explodes the learning time)
B) Its a memory hog. For really large models you really need Hbase to store
counts efficiently for large models. The class has a cache for frequently
used words in the language. So the overhead of classification is based on
the number on infrequent words in the document and the amount of parallel
lookups you can do on a Hbase cluster.
The new NB and CNB code is as follows:
- The redesigned naive bayes doesnt work over words. It assumes input is
document vector and computes tf-idf and weights. (This is implemented)
- The perclass weight vectors are kept in memory and updated. So the
limiting factor here is the number of classes * number of dimensions. (This
is implemented)
- If the vector space is limited using randomized hashing (ted's
technique), then you can limit the space. However for (all possible) ngrams
you will need a large dimension, which makes it unusable. (This is not
done).
- So one needs to create collocation vectors smartly (This is not done).
- The implmentation as of now, learns the model, has model serialization
and deserialization methods, and an interface for classifying using the
loaded model. (This is implemented)
Issues
A) It lacks train and test driver code. Its just has the core implementation
B) It is not integrated with the evaluation classes (Confusion Matrix, Per
label precision/recall)
C) We need to port the collocations driver to generate collocations and
convert documents to vectors.
D) The multilabel classifier is not using any common interface like the
logistic regression package.
When I checked in the code I didnt have time to pursue this. If someone can
recommend, the right approach to fixing this package. Like the right
interface to use, How it should behave with rest of the code. It becomes
easier for be to jump back on moulding the current implementation.
------
Robin Anil
On Wed, May 9, 2012 at 5:48 AM, Grant Ingersoll <[email protected]> wrote:
>
> On May 8, 2012, at 12:43 PM, Jake Mannix wrote:
>
> > On Tue, May 8, 2012 at 9:31 AM, Ted Dunning <[email protected]>
> wrote:
> >
> >> This is frustrating to consider losing Bayes, but I would consider
> keeping
> >> it if only to decrease the number of questions on the list about why the
> >> examples from the book don't work.
> >>
> >
> > Could maybe someone just sit down and rewrite it? Naive Bayes is not a
> > particularly
> > difficult thing to implement, even distributed (it's like, word-count,
> > basically. Ok,
> > maybe it's more like counting collocations, but still!).
> >
> > It would be pretty silly to not have an NB impl (although I agree that
> it's
> > even worse
> > to have a broken or clunky one).
>
> I agree. The vector based one is a rewrite, so we probably should just go
> from there. Not sure it is broken, but Robin is the primary person
> familiar with it and in the past I've pinged the list on the state of it
> (and trying to get explanations on certain parts of it) and not gotten
> answers.
> With all of these Hadoop algorithms, the other thing we really need is to
> make them programmatically easier to integrate. The Driver mode is not too
> bad for testing, etc. but it makes it harder to integrate, as others have
> pointed out.