Any directions on what pattern I should follow for the redesign. ------ Robin Anil
On Wed, May 9, 2012 at 9:49 AM, Robin Anil <[email protected]> wrote: > I believe most of this new NB discussion has been over chat. So here is > the state of the NB universe from my view > > 1) Original NB and CNB code was as follows > - Tokenize and find all possible collocations > - Compute Tf and Idf for each ngram > - Compute Global and per class sums for tf, idf and tf-idf > - Dump these counts in SequenceFiles > - Load these into memory or Hbase and do compute score for each > vector-label combination > > Issues > A) its slow (collocation thought its efficient in its implementation (zero > memory overhead, using secondary sort), just explodes the learning time) > B) Its a memory hog. For really large models you really need Hbase to > store counts efficiently for large models. The class has a cache for > frequently used words in the language. So the overhead of classification is > based on the number on infrequent words in the document and the amount of > parallel lookups you can do on a Hbase cluster. > > > The new NB and CNB code is as follows: > - The redesigned naive bayes doesnt work over words. It assumes input > is document vector and computes tf-idf and weights. (This is implemented) > - The perclass weight vectors are kept in memory and updated. So the > limiting factor here is the number of classes * number of dimensions. (This > is implemented) > - If the vector space is limited using randomized hashing (ted's > technique), then you can limit the space. However for (all possible) ngrams > you will need a large dimension, which makes it unusable. (This is not > done). > - So one needs to create collocation vectors smartly (This is not done). > - The implmentation as of now, learns the model, has model > serialization and deserialization methods, and an interface for classifying > using the loaded model. (This is implemented) > > Issues > A) It lacks train and test driver code. Its just has the core > implementation > B) It is not integrated with the evaluation classes (Confusion Matrix, Per > label precision/recall) > C) We need to port the collocations driver to generate collocations and > convert documents to vectors. > D) The multilabel classifier is not using any common interface like the > logistic regression package. > > When I checked in the code I didnt have time to pursue this. If someone > can recommend, the right approach to fixing this package. Like the right > interface to use, How it should behave with rest of the code. It becomes > easier for be to jump back on moulding the current implementation. > > ------ > Robin Anil > > > > On Wed, May 9, 2012 at 5:48 AM, Grant Ingersoll <[email protected]>wrote: > >> >> On May 8, 2012, at 12:43 PM, Jake Mannix wrote: >> >> > On Tue, May 8, 2012 at 9:31 AM, Ted Dunning <[email protected]> >> wrote: >> > >> >> This is frustrating to consider losing Bayes, but I would consider >> keeping >> >> it if only to decrease the number of questions on the list about why >> the >> >> examples from the book don't work. >> >> >> > >> > Could maybe someone just sit down and rewrite it? Naive Bayes is not a >> > particularly >> > difficult thing to implement, even distributed (it's like, word-count, >> > basically. Ok, >> > maybe it's more like counting collocations, but still!). >> > >> > It would be pretty silly to not have an NB impl (although I agree that >> it's >> > even worse >> > to have a broken or clunky one). >> >> I agree. The vector based one is a rewrite, so we probably should just >> go from there. Not sure it is broken, but Robin is the primary person >> familiar with it and in the past I've pinged the list on the state of it >> (and trying to get explanations on certain parts of it) and not gotten >> answers. > > > > >> With all of these Hadoop algorithms, the other thing we really need is to >> make them programmatically easier to integrate. The Driver mode is not too >> bad for testing, etc. but it makes it harder to integrate, as others have >> pointed out. > > >
