Setting the numfeatures higher than vocab size will tend to reduce the chance 
of hash collisions, but it's not strictly necessary - it becomes a memory / 
accuracy trade off.





Surprisingly, the impact on model performance of moderate hash collisions is 
often not significant.






So it may be worth trying a few settings out (lower than vocab, higher etc) and 
see what the impact is on evaluation metrics.



—
Sent from Mailbox

On Thu, Oct 15, 2015 at 5:46 PM, Jianguo Li <flyingfromch...@gmail.com>
wrote:

> Hi,
> There is a parameter in the HashingTF called "numFeatures". I was wondering
> what is the best way to set the value to this parameter. In the use case of
> text categorization, do you need to know in advance the number of words in
> your vocabulary? or do you set it to be a large value, greater than the
> number of words in your vocabulary?
> Thanks,
> Jianguo

Reply via email to