2013/11/29 Olivier Grisel :
> 2013/11/29 Andreas Hjortgaard Danielsen :
>> Hi,
>>
>> It might be worth noting that Lucene uses the same implementation:
>> http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
>
> Same as what? The current master or @lar
On 29 November 2013 14:43, Olivier Grisel wrote:
> 2013/11/29 Andreas Hjortgaard Danielsen :
> > Hi,
> >
> > It might be worth noting that Lucene uses the same implementation:
> >
> http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
>
> Same as wha
2013/11/29 Andreas Hjortgaard Danielsen :
> Hi,
>
> It might be worth noting that Lucene uses the same implementation:
> http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
Same as what? The current master or @larsmans' suggested fix?
> And Gensim h
Hi,
It might be worth noting that Lucene uses the same implementation:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
And Gensim has an option for choosing an addition constant (although the
default is 0).
https://github.com/piskvorky/gensim/bl
Alright! By removing the +1 the results seem much more legit.
Also, the sublinear transformation makes sense. However, why use min_df=2 if I
am worried about very common words?
-Ursprüngliche Nachricht-
Von: Lars Buitinck [mailto:larsm...@gmail.com]
Gesendet: Freitag, 29. November 2013
> Anyway, if you're worried about very common words, try setting
min_df=2, and if you have a few long documents, try sublinear_tf=True.
That replaces tf with 1 + log(tf) so repeated occurrences of a word
get penalized.
To trim words that occur more than 90% of the time, `max_df=0.9` works
great to
2013/11/29 Philipp Singer :
> Nevertheless, when I look up the top tfidf terms for each document, such
> high frequent terms are on the top of the list even though they occur in
> each single document. I took a deeper look into the specific values, and it
> appears that all these terms – which occu
Hi there,
I am currently working with the TfidfVectorizer provided by scikit learn.
However, I just came up with a problem/question. In my case I have around 20
very long documents. Some terms in these documents occur much, much more
frequently than others. From my pure intuition, these terms s