Setting the numfeatures higher than vocab size will tend to reduce the chance of hash collisions, but it's not strictly necessary - it becomes a memory / accuracy trade off.
Surprisingly, the impact on model performance of moderate hash collisions is often not significant. So it may be worth trying a few settings out (lower than vocab, higher etc) and see what the impact is on evaluation metrics. — Sent from Mailbox On Thu, Oct 15, 2015 at 5:46 PM, Jianguo Li <flyingfromch...@gmail.com> wrote: > Hi, > There is a parameter in the HashingTF called "numFeatures". I was wondering > what is the best way to set the value to this parameter. In the use case of > text categorization, do you need to know in advance the number of words in > your vocabulary? or do you set it to be a large value, greater than the > number of words in your vocabulary? > Thanks, > Jianguo