Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-02 Thread Nick Pentreath
Note that both HashingTF and CountVectorizer are usually used for creating
TF-IDF normalized vectors. The definition (
https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition) of term frequency
in TF-IDF is actually the "number of times the term occurs in the document".

So it's perhaps a bit of a misnomer, but the implementation is correct.

On Tue, 2 Aug 2016 at 05:44 Yanbo Liang  wrote:

> Hi Hao,
>
> HashingTF directly apply a hash function (Murmurhash3) to the features to
> determine their column index. It excluded any thought about the term
> frequency or the length of the document. It does similar work compared with
> sklearn FeatureHasher. The result is increased speed and reduced memory
> usage, but it does not remember what the input features looked like and can
> not convert the output back to the original features. Actually we misnamed
> this transformer, it only does the work of feature hashing rather than
> computing hashing term frequency.
>
> CountVectorizer will select the top vocabSize words ordered by term
> frequency across the corpus to build the hash table of the features. So it
> will consume more memory than HashingTF. However, we can convert the output
> back to the original feature.
>
> Both of the transformers do not consider the length of each document. If
> you want to compute term frequency divided by the length of the document,
> you should write your own function based on transformers provided by MLlib.
>
> Thanks
> Yanbo
>
> 2016-08-01 15:29 GMT-07:00 Hao Ren :
>
>> When computing term frequency, we can use either HashTF or
>> CountVectorizer feature extractors.
>> However, both of them just use the number of times that a term appears in
>> a document.
>> It is not a true frequency. Acutally, it should be divided by the length
>> of the document.
>>
>> Is this a wanted feature ?
>>
>> --
>> Hao Ren
>>
>> Data Engineer @ leboncoin
>>
>> Paris, France
>>
>
>


Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-01 Thread Yanbo Liang
Hi Hao,

HashingTF directly apply a hash function (Murmurhash3) to the features to
determine their column index. It excluded any thought about the term
frequency or the length of the document. It does similar work compared with
sklearn FeatureHasher. The result is increased speed and reduced memory
usage, but it does not remember what the input features looked like and can
not convert the output back to the original features. Actually we misnamed
this transformer, it only does the work of feature hashing rather than
computing hashing term frequency.

CountVectorizer will select the top vocabSize words ordered by term
frequency across the corpus to build the hash table of the features. So it
will consume more memory than HashingTF. However, we can convert the output
back to the original feature.

Both of the transformers do not consider the length of each document. If
you want to compute term frequency divided by the length of the document,
you should write your own function based on transformers provided by MLlib.

Thanks
Yanbo

2016-08-01 15:29 GMT-07:00 Hao Ren :

> When computing term frequency, we can use either HashTF or CountVectorizer
> feature extractors.
> However, both of them just use the number of times that a term appears in
> a document.
> It is not a true frequency. Acutally, it should be divided by the length
> of the document.
>
> Is this a wanted feature ?
>
> --
> Hao Ren
>
> Data Engineer @ leboncoin
>
> Paris, France
>