Hi!

I'm having a bit trouble understanding how the HashingVectorizer in sklearn
works.

In the following example, I vectorize two text-documents with 3 words each.
In my understanding each of the three words should be extracted as a feature
and then hashed to a place in the output vector.


In [133]: hasher
Out[133]:
HashingVectorizer(analyzer='word', binary=False, charset='utf-8',
         charset_error='strict', dtype=<type 'numpy.float64'>,
         input='content', lowercase=True, n_features=5, ngram_range=(1, 1),
         non_negative=True, norm=None, preprocessor=None, stop_words=None,
         strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
         tokenizer=None)

In [134]: hasher.transform(["this simple test"]).todense()
Out[134]: matrix([[ 0.,  1.,  1.,  0.,  1.]])

In [135]: hasher.transform(["this simple fest"]).todense()
Out[135]: matrix([[ 0.,  0.,  1.,  0.,  0.]])


As we can see, in the first example "this simple test" the three features
are hashed
to the indices 0, 1, and 4, so there seem to be no hashing collisions.
However, now, hashing "this simple fest" (in which only one word is
different),
produces an output vector with only one index. Now it could be that hash
"fest" collides
with one of the other two words, however, in my understanding there is
still one feature
missing in the output vector.

So I'm wondering, if I just misunderstood how the hashing vectorizer works
or
if I'm using the class incorrectly.

Thanks for any help,
Tobias
------------------------------------------------------------------------------
Get your SQL database under version control now!
Version control is standard for application code, but databases havent 
caught up. So what steps can you take to put your SQL databases under 
version control? Why should you start doing it? Read more to find out.
http://pubads.g.doubleclick.net/gampad/clk?id=49501711&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to