Any directions on what pattern I should follow for the redesign.
------
Robin Anil


On Wed, May 9, 2012 at 9:49 AM, Robin Anil <[email protected]> wrote:

> I believe most of this new NB discussion has been over chat. So here is
> the state of the NB universe from my view
>
> 1) Original NB and CNB code was as follows
>     - Tokenize and find all possible collocations
>     - Compute Tf and Idf for each ngram
>     - Compute Global and per class sums for tf, idf and tf-idf
>     - Dump these counts in SequenceFiles
>     - Load these into memory or Hbase and do compute score for each
> vector-label combination
>
> Issues
> A) its slow (collocation thought its efficient in its implementation (zero
> memory overhead, using secondary sort), just explodes the learning time)
> B) Its a memory hog. For really large models you really need Hbase to
> store counts efficiently for large models. The class has a cache for
> frequently used words in the language. So the overhead of classification is
> based on the number on infrequent words in the document and the amount of
> parallel lookups you can do on a Hbase cluster.
>
>
> The new NB and CNB code is as follows:
>    - The redesigned naive bayes doesnt work over words. It assumes input
> is document vector and computes tf-idf and weights. (This is implemented)
>    - The perclass weight vectors are kept in memory and updated. So the
> limiting factor here is the number of classes * number of dimensions. (This
> is implemented)
>    - If the vector space is limited using randomized hashing (ted's
> technique), then you can limit the space. However for (all possible) ngrams
> you will need a large dimension, which makes it unusable. (This is not
> done).
>    - So one needs to create collocation vectors smartly (This is not done).
>    - The implmentation as of now, learns the model, has model
> serialization and deserialization methods, and an interface for classifying
> using the loaded model. (This is implemented)
>
> Issues
> A) It lacks train and test driver code. Its just has the core
> implementation
> B) It is not integrated with the evaluation classes (Confusion Matrix, Per
> label precision/recall)
> C) We need to port the collocations driver to generate collocations and
> convert documents to vectors.
> D) The multilabel classifier is not using any common interface like the
> logistic regression package.
>
> When I checked in the code I didnt have time to pursue this. If someone
> can recommend, the right approach to fixing this package. Like the right
> interface to use, How it should behave with rest of the code. It becomes
> easier for be to jump back on moulding the current implementation.
>
> ------
> Robin Anil
>
>
>
> On Wed, May 9, 2012 at 5:48 AM, Grant Ingersoll <[email protected]>wrote:
>
>>
>> On May 8, 2012, at 12:43 PM, Jake Mannix wrote:
>>
>> > On Tue, May 8, 2012 at 9:31 AM, Ted Dunning <[email protected]>
>> wrote:
>> >
>> >> This is frustrating to consider losing Bayes, but I would consider
>> keeping
>> >> it if only to decrease the number of questions on the list about why
>> the
>> >> examples from the book don't work.
>> >>
>> >
>> > Could maybe someone just sit down and rewrite it?  Naive Bayes is not a
>> > particularly
>> > difficult thing to implement, even distributed (it's like, word-count,
>> > basically.  Ok,
>> > maybe it's more like counting collocations, but still!).
>> >
>> > It would be pretty silly to not have an NB impl (although I agree that
>> it's
>> > even worse
>> > to have a broken or clunky one).
>>
>> I agree.  The vector based one is a rewrite, so we probably should just
>> go from there.  Not sure it is broken, but Robin is the primary person
>> familiar with it and in the past I've pinged the list on the state of it
>> (and trying to get explanations on certain parts of it) and not gotten
>> answers.
>
>
>
>
>> With all of these Hadoop algorithms, the other thing we really need is to
>> make them programmatically easier to integrate.  The Driver mode is not too
>> bad for testing, etc. but it makes it harder to integrate, as others have
>> pointed out.
>
>
>

Reply via email to