On Sat, Sep 25, 2010 at 1:57 PM, Robin Anil <[email protected]> wrote:

> I currently call it in tf job or idf job at the end when merging the
> partial
> vectors. This throws away the feature counting and tfidf jobs in naive
> bayes. Now all I need is to port the weight summer, and weight
> normalization
> jobs. Just two jobs to create the model from tfidf vectors.
>

Reasonable approach.  With the sgd code, I avoid an IDF computation by using
an annealed per term feature learning rate.

If this annealing goes with 1/n where n is the number of instances seen so
far, the final sum is ~ log N where N is the total number of occurrences.
 That saves a pass through the data which, when you are doing online
learning, is critical.


>
> Or
>
> The naive bayes can generate the model from the vectors generated from
> Hashed Feature vectorizer
>
> Multi field documents can generate a word feature = Field + Word. And use
> dictionary vectorizer or Hash feature vectorizer to convert that to
> vectors.
> I say let there be collisions. Since increasing the number of bits can
> decrease the collision, VW takes that approach. Let the people who worry
> increase the number of bits :)
>

I also provide the ability to probe the vector more than once.  This makes
smaller vectors much more usable
in the same way that Bloom filters can use smaller, over-filled bit vectors.

In production, we clearly see a few cases of collisions when inspecting the
dissected models, but very rarely.

Reply via email to