2013/8/3 Olivier Grisel <[email protected]>: > It looks like a bug but I cannot reproduce it: > >>>> from sklearn.feature_extraction.text import HashingVectorizer >>>> vec = HashingVectorizer(n_features=5, binary=True, norm=None) >>>> vec.transform(['this simple test']).toarray() > array([[ 0., 1., 1., 0., 1.]]) >>>> vec.transform(['this simple fest']).toarray() > array([[ 0., 1., 1., 0., 0.]]) > > In this case I have a single collision.
I get the same results as the OP: >>> vec = HashingVectorizer(n_features=5, binary=False, norm=None) >>> vec.transform(['this simple test']).toarray() array([[ 0., -1., 1., 0., -1.]]) >>> vec.transform(['this simple fest']).toarray() array([[ 0., 0., 1., 0., 0.]]) This actually makes sense, since the feature hasher uses a signed hash function to eliminate some of the collisions. The algorithm is detailed in the docs [1], but in short, the signed hash function gives a 50% probability that collisions cancel out, giving zero in the output column. Olivier, what result do you get with binary=False? That shows more clearly where the collisions occur. [1] http://scikit-learn.org/stable/modules/feature_extraction.html#feature-hashing -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam ------------------------------------------------------------------------------ Get your SQL database under version control now! Version control is standard for application code, but databases havent caught up. So what steps can you take to put your SQL databases under version control? Why should you start doing it? Read more to find out. http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
