Note that both HashingTF and CountVectorizer are usually used for creating
TF-IDF normalized vectors. The definition (
https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition) of term frequency
in TF-IDF is actually the "number of times the term occurs in the document".
So it's perhaps a bit of a
Hi Hao,
HashingTF directly apply a hash function (Murmurhash3) to the features to
determine their column index. It excluded any thought about the term
frequency or the length of the document. It does similar work compared with
sklearn FeatureHasher. The result is increased speed and reduced memory
When computing term frequency, we can use either HashTF or CountVectorizer
feature extractors.
However, both of them just use the number of times that a term appears in a
document.
It is not a true frequency. Acutally, it should be divided by the length of
the document.
Is this a wanted feature ?