This is a slightly tricky question when it comes to hashed feature vectors containing data from several fields. Especially in cases with very large feature sets, collisions within a single document are probable even with large feature vectors.
I have toyed with several approaches: - one way is to count the words in the document and only insert log(TF) in a cleanup phase. This leads to complexity when you don't get the entire document at once, but instead get it, say, a line at a time. Concatenating the lines in memory and then converting them at once absolutely kills performance. org.apache.mahout.vectors.TextValueEncoder takes this approach and provides addText and flush methods. The addToVector method combines these for convenience if you do happen to have the whole thing handy. - another way is to convert the document progressively into a single vector and the to the log of that vector before adding it to the real feature vector. This avoids the counter table in the value encoder, but rarely goes pretty wrong in the face of collisions. I didn't like this approach, but it would be easy to try and I didn't have specific complaints, just a grumbly feeling. - one way that will work for 20 newsgroups as handled by the current naive bayes code, but will not work in general is to just accumulate data into a feature vector and then do assign(Functions.LOG) to that feature vector. This is like the first half of the second approach without the second half. I don't feel that this is a good approach at all even if it would be faster than either of the first two approaches. The major problem is that it makes multi-field documents impossible to think about. On Sat, Sep 25, 2010 at 12:46 PM, Robin Anil <[email protected]> wrote: > Rewrite Question > > A key thing that improves accuracy of naivebayes over text is the > normalization over TF Vector (V) > > new V_i = Log(1 + V_i) / SQRT(Sigma_k(V_k)); > > AbstractVector already does L_p norm, does it make sense to add one > function > to do the above normalization? Say logNormalize(double x). I will be adding > this to PartialVector Merger (in DictionaryVectorizer). So two choices, I > can do this in the Vectorizer or the Vectorizer can call this function ? > > > > Robin > > > On Sat, Sep 25, 2010 at 10:22 PM, Sean Owen <[email protected]> wrote: > > > I think it's fine to do a rewrite at this stage. 0.5 sounds like a > > nice goal. Just recall that aspects of this will be 'in print' soon so > > yeah you want to a) plan to deprecate rather than remove the old code > > for some time, b) make the existing code "forwards compatible" with > > what you'll do next while you have the chance! > > > > On Sat, Sep 25, 2010 at 2:32 PM, Robin Anil <[email protected]> > wrote: > > > Hi, I was in the middle of changing the classifier over to to vectors > and > > I > > > realized how radically it will change for people using it and how > > difficult > > > it is to fit the new interfaces ted checked it. There are many > components > > to > > > it, including the Hbase stuff, which will take a lot of time to port. I > > > think its best to start from scratch rewrite it, keeping the old > version > > so > > > that it wont break for users using it?. If that is agreeable, I can > > complete > > > a new map/reduce + imemory classifier in o.a.m.c.naivebayes fitting the > > > interfaces and deprecate the old bayes package?. The new package wont > > have > > > the full set of features as the old for 0.4 release. But it will be > > > functional, and hopefully future proof. Let me know your thoughts > > > > > > Robin > > > > > >
