2013/8/3 Olivier Grisel <[email protected]>:
> It looks like a bug but I cannot reproduce it:
>
>>>> from sklearn.feature_extraction.text import HashingVectorizer
>>>> vec = HashingVectorizer(n_features=5, binary=True, norm=None)
>>>> vec.transform(['this simple test']).toarray()
> array([[ 0.,  1.,  1.,  0.,  1.]])
>>>> vec.transform(['this simple fest']).toarray()
> array([[ 0.,  1.,  1.,  0.,  0.]])
>
> In this case I have a single collision.

I get the same results as the OP:

>>> vec = HashingVectorizer(n_features=5, binary=False, norm=None)
>>> vec.transform(['this simple test']).toarray()
array([[ 0., -1.,  1.,  0., -1.]])
>>> vec.transform(['this simple fest']).toarray()
array([[ 0.,  0.,  1.,  0.,  0.]])

This actually makes sense, since the feature hasher uses a signed hash
function to eliminate some of the collisions. The algorithm is
detailed in the docs [1], but in short, the signed hash function gives
a 50% probability that collisions cancel out, giving zero in the
output column.

Olivier, what result do you get with binary=False? That shows more
clearly where the collisions occur.

[1] 
http://scikit-learn.org/stable/modules/feature_extraction.html#feature-hashing

-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
Get your SQL database under version control now!
Version control is standard for application code, but databases havent 
caught up. So what steps can you take to put your SQL databases under 
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to