Hi Hao, HashingTF directly apply a hash function (Murmurhash3) to the features to determine their column index. It excluded any thought about the term frequency or the length of the document. It does similar work compared with sklearn FeatureHasher. The result is increased speed and reduced memory usage, but it does not remember what the input features looked like and can not convert the output back to the original features. Actually we misnamed this transformer, it only does the work of feature hashing rather than computing hashing term frequency.
CountVectorizer will select the top vocabSize words ordered by term frequency across the corpus to build the hash table of the features. So it will consume more memory than HashingTF. However, we can convert the output back to the original feature. Both of the transformers do not consider the length of each document. If you want to compute term frequency divided by the length of the document, you should write your own function based on transformers provided by MLlib. Thanks Yanbo 2016-08-01 15:29 GMT-07:00 Hao Ren <inv...@gmail.com>: > When computing term frequency, we can use either HashTF or CountVectorizer > feature extractors. > However, both of them just use the number of times that a term appears in > a document. > It is not a true frequency. Acutally, it should be divided by the length > of the document. > > Is this a wanted feature ? > > -- > Hao Ren > > Data Engineer @ leboncoin > > Paris, France >