On Sun, Sep 26, 2010 at 2:05 AM, Ted Dunning <[email protected]> wrote:
> This is a slightly tricky question when it comes to hashed feature vectors
> containing data from several fields. Especially in cases with very large
> feature sets, collisions within a single document are probable even with
> large feature vectors.
>
I agree, thats why I am going to log normalize only in dictionary
vectorizer. The function can still exist in AbstractVector
public Vector logNormalize() {
return logNormalize(2, Math.sqrt(dotSelf()));
}
public Vector logNormalize(double power) {
return logNormalize(power, norm(power));
}
public Vector logNormalize(double power, double normLength) {
// we can special case certain powers
if (Double.isInfinite(power) || power <= 1.0) {
throw new IllegalArgumentException("Power must be > 1 and < infinity"
);
} else {
double denominator = normLength * Math.log(power);
Vector result = like().assign(this);
Iterator<Element> iter = result.iterateNonZero();
while (iter.hasNext()) {
Element element = iter.next();
element.set(Math.log(1 + element.get()) / denominator);
}
return result;
}
}
I currently call it in tf job or idf job at the end when merging the partial
vectors. This throws away the feature counting and tfidf jobs in naive
bayes. Now all I need is to port the weight summer, and weight normalization
jobs. Just two jobs to create the model from tfidf vectors.
Or
The naive bayes can generate the model from the vectors generated from
Hashed Feature vectorizer
Multi field documents can generate a word feature = Field + Word. And use
dictionary vectorizer or Hash feature vectorizer to convert that to vectors.
I say let there be collisions. Since increasing the number of bits can
decrease the collision, VW takes that approach. Let the people who worry
increase the number of bits :)
Robin
> On Sat, Sep 25, 2010 at 12:46 PM, Robin Anil <[email protected]> wrote:
>
> > Rewrite Question
> >
> > A key thing that improves accuracy of naivebayes over text is the
> > normalization over TF Vector (V)
> >
> > new V_i = Log(1 + V_i) / SQRT(Sigma_k(V_k));
> >
> > AbstractVector already does L_p norm, does it make sense to add one
> > function
> > to do the above normalization? Say logNormalize(double x). I will be
> adding
> > this to PartialVector Merger (in DictionaryVectorizer). So two choices, I
> > can do this in the Vectorizer or the Vectorizer can call this function ?
> >
> >
> >
> > Robin
> >
> >
> > On Sat, Sep 25, 2010 at 10:22 PM, Sean Owen <[email protected]> wrote:
> >
> > > I think it's fine to do a rewrite at this stage. 0.5 sounds like a
> > > nice goal. Just recall that aspects of this will be 'in print' soon so
> > > yeah you want to a) plan to deprecate rather than remove the old code
> > > for some time, b) make the existing code "forwards compatible" with
> > > what you'll do next while you have the chance!
> > >
> > > On Sat, Sep 25, 2010 at 2:32 PM, Robin Anil <[email protected]>
> > wrote:
> > > > Hi, I was in the middle of changing the classifier over to to vectors
> > and
> > > I
> > > > realized how radically it will change for people using it and how
> > > difficult
> > > > it is to fit the new interfaces ted checked it. There are many
> > components
> > > to
> > > > it, including the Hbase stuff, which will take a lot of time to port.
> I
> > > > think its best to start from scratch rewrite it, keeping the old
> > version
> > > so
> > > > that it wont break for users using it?. If that is agreeable, I can
> > > complete
> > > > a new map/reduce + imemory classifier in o.a.m.c.naivebayes fitting
> the
> > > > interfaces and deprecate the old bayes package?. The new package wont
> > > have
> > > > the full set of features as the old for 0.4 release. But it will be
> > > > functional, and hopefully future proof. Let me know your thoughts
> > > >
> > > > Robin
> > > >
> > >
> >
>