Note that both HashingTF and CountVectorizer are usually used for creating TF-IDF normalized vectors. The definition ( https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition) of term frequency in TF-IDF is actually the "number of times the term occurs in the document".
So it's perhaps a bit of a misnomer, but the implementation is correct. On Tue, 2 Aug 2016 at 05:44 Yanbo Liang <yblia...@gmail.com> wrote: > Hi Hao, > > HashingTF directly apply a hash function (Murmurhash3) to the features to > determine their column index. It excluded any thought about the term > frequency or the length of the document. It does similar work compared with > sklearn FeatureHasher. The result is increased speed and reduced memory > usage, but it does not remember what the input features looked like and can > not convert the output back to the original features. Actually we misnamed > this transformer, it only does the work of feature hashing rather than > computing hashing term frequency. > > CountVectorizer will select the top vocabSize words ordered by term > frequency across the corpus to build the hash table of the features. So it > will consume more memory than HashingTF. However, we can convert the output > back to the original feature. > > Both of the transformers do not consider the length of each document. If > you want to compute term frequency divided by the length of the document, > you should write your own function based on transformers provided by MLlib. > > Thanks > Yanbo > > 2016-08-01 15:29 GMT-07:00 Hao Ren <inv...@gmail.com>: > >> When computing term frequency, we can use either HashTF or >> CountVectorizer feature extractors. >> However, both of them just use the number of times that a term appears in >> a document. >> It is not a true frequency. Acutally, it should be divided by the length >> of the document. >> >> Is this a wanted feature ? >> >> -- >> Hao Ren >> >> Data Engineer @ leboncoin >> >> Paris, France >> > >