Re: [Scikit-learn-general] Problem in text feature extraction (sklearn.feature_extraction.text)

2013-02-24 Thread amueller
the missing 2 in tokenizing 2.50 is indeed a bit weird, though. Tom Fawcett schrieb: >First, thanks for all your great work on scikits.learn! It’s making my >life easier. > >Second, I found surprising behavior in sklearn.feature_extraction.text. >I’m using TfidfVectorizer and CountVectorizer

Re: [Scikit-learn-general] Problem in text feature extraction (sklearn.feature_extraction.text)

2013-02-24 Thread amueller
for the missing 'r' in the docs: it looks like a sphnix glitch to me and I have not found a way to fix. for the tokenization: the sklearn regexp seems like a sensible default to me. what would you change it to so as to still be robust? Tom Fawcett schrieb: >First, thanks for all your great w

Re: [Scikit-learn-general] r2_score producing negative values

2013-02-24 Thread Gael Varoquaux
On Sun, Feb 24, 2013 at 04:32:05PM -0500, Ronnie Ghose wrote: > On Thu, Jan 24, 2013 at 11:50 AM, Flavio Vinicius  wrote: > I think you can only guarantee that R2 is always positive when > performing linear regression with no constraints. I believe that the linear regression should be unr

Re: [Scikit-learn-general] r2_score producing negative values

2013-02-24 Thread Ronnie Ghose
So I had a similar question . a month or two ago? I think, so I think that's relevant ~~ here it is. It's still sort of a surprise to me too . On Thu, Jan 24, 2013 at 11:50 AM, Flavio Vinicius wrote: > I think you can only guarantee that R2 is always positive when > performing linear regres

Re: [Scikit-learn-general] r2_score producing negative values

2013-02-24 Thread Gael Varoquaux
On Fri, Feb 22, 2013 at 01:39:07PM -0500, Steven Greening wrote: > I tried to used r2_score to calculate the coefficient of determination > for a multiple regression problem and find that it is producing > negative values. That shouldn't be a surprise: an r2_score of 0 is chance. What you are find

[Scikit-learn-general] r2_score producing negative values

2013-02-24 Thread Steven Greening
Hello all, I tried to used r2_score to calculate the coefficient of determination for a multiple regression problem and find that it is producing negative values. Specifically I'm using the r2_score function with the permutation_test_score function, and a large majority of the r2 values from the p

Re: [Scikit-learn-general] Problem in text feature extraction (sklearn.feature_extraction.text)

2013-02-24 Thread Ronnie Ghose
e, I think it should be kept as it is? imho, it's that way in case you have something irregular such as "the.cat.in.the.hat.23.45.6632" . i'm assuming the $ is treated as just another punctuation sign. ex. no special treatment for the pound / euro / yen / etc signs. I think thouse should be ke

[Scikit-learn-general] Problem in text feature extraction (sklearn.feature_extraction.text)

2013-02-24 Thread Tom Fawcett
First, thanks for all your great work on scikits.learn! It’s making my life easier. Second, I found surprising behavior in sklearn.feature_extraction.text. I’m using TfidfVectorizer and CountVectorizer to process news stories. The default tokenizer uses the regular expression '(?u)\b\w\w+\b’