Re: [Scikit-learn-general] TFIDF question

2013-11-29 Thread Lars Buitinck
2013/11/29 Olivier Grisel : > 2013/11/29 Andreas Hjortgaard Danielsen : >> Hi, >> >> It might be worth noting that Lucene uses the same implementation: >> http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html > > Same as what? The current master or @lar

Re: [Scikit-learn-general] TFIDF question

2013-11-29 Thread Andreas Hjortgaard Danielsen
On 29 November 2013 14:43, Olivier Grisel wrote: > 2013/11/29 Andreas Hjortgaard Danielsen : > > Hi, > > > > It might be worth noting that Lucene uses the same implementation: > > > http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html > > Same as wha

Re: [Scikit-learn-general] TFIDF question

2013-11-29 Thread Olivier Grisel
2013/11/29 Andreas Hjortgaard Danielsen : > Hi, > > It might be worth noting that Lucene uses the same implementation: > http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html Same as what? The current master or @larsmans' suggested fix? > And Gensim h

Re: [Scikit-learn-general] TFIDF question

2013-11-29 Thread Andreas Hjortgaard Danielsen
Hi, It might be worth noting that Lucene uses the same implementation: http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html And Gensim has an option for choosing an addition constant (although the default is 0). https://github.com/piskvorky/gensim/bl

Re: [Scikit-learn-general] TFIDF question

2013-11-29 Thread Philipp Singer
Alright! By removing the +1 the results seem much more legit. Also, the sublinear transformation makes sense. However, why use min_df=2 if I am worried about very common words? -Ursprüngliche Nachricht- Von: Lars Buitinck [mailto:larsm...@gmail.com] Gesendet: Freitag, 29. November 2013

Re: [Scikit-learn-general] TFIDF question

2013-11-29 Thread Olivier Grisel
> Anyway, if you're worried about very common words, try setting min_df=2, and if you have a few long documents, try sublinear_tf=True. That replaces tf with 1 + log(tf) so repeated occurrences of a word get penalized. To trim words that occur more than 90% of the time, `max_df=0.9` works great to

Re: [Scikit-learn-general] TFIDF question

2013-11-29 Thread Lars Buitinck
2013/11/29 Philipp Singer : > Nevertheless, when I look up the top tfidf terms for each document, such > high frequent terms are on the top of the list even though they occur in > each single document. I took a deeper look into the specific values, and it > appears that all these terms – which occu

[Scikit-learn-general] TFIDF question

2013-11-29 Thread Philipp Singer
Hi there, I am currently working with the TfidfVectorizer provided by scikit learn. However, I just came up with a problem/question. In my case I have around 20 very long documents. Some terms in these documents occur much, much more frequently than others. From my pure intuition, these terms s