Re: [Scikit-learn-general] Text classification and file names output

2012-08-12 Thread Robert Layton
On 10 August 2012 07:14, Jack Alan wrote: > Hi everyone, > > I'm working on text classification on the tutorial provided: > document_classification_20newsgroups.py > > I wonder how I'll be able to print a list of the documents' names being > used in the test folder with their predicted classes af

Re: [Scikit-learn-general] GridSearch over min_n and max_n in CountVectorizer

2012-08-12 Thread Robert Layton
On 13 August 2012 01:56, Andreas Mueller wrote: > On 08/12/2012 01:56 PM, Alexandre Gramfort wrote: > >> Hey Everybody. > >> If was just trying to use CountVectorizer but I have trouble using > >> Gridsearch using both max_n and min_n. > >> I guess the problem is that the parameter are conditione

Re: [Scikit-learn-general] No methods seem to predict well

2012-08-12 Thread Zach Bastick
Thanks, I removed the uneccessary fit method. Regarding normalization, aren't the features automatically normalized with the l2 norm when using tfid? vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(trainingTexts) Just in case, I added the following but get the same resu

Re: [Scikit-learn-general] GridSearch over min_n and max_n in CountVectorizer

2012-08-12 Thread Andreas Mueller
On 08/12/2012 01:56 PM, Alexandre Gramfort wrote: >> Hey Everybody. >> If was just trying to use CountVectorizer but I have trouble using >> Gridsearch using both max_n and min_n. >> I guess the problem is that the parameter are conditioned on each other. >> Is there a nice way to do this? >> I gue

Re: [Scikit-learn-general] GridSearch over min_n and max_n in CountVectorizer

2012-08-12 Thread Alexandre Gramfort
> Hey Everybody. > If was just trying to use CountVectorizer but I have trouble using > Gridsearch using both max_n and min_n. > I guess the problem is that the parameter are conditioned on each other. > Is there a nice way to do this? > I guess I could generate lists of param_grids, i.e. one for e

Re: [Scikit-learn-general] Differences between SVC(kernel='linear') and LinearSVC

2012-08-12 Thread Mathieu Blondel
On Sun, Aug 12, 2012 at 6:53 PM, Andreas Mueller wrote: > Does any one have an explanation for that? > Btw, I am using the sparse versions to do some text classification. > One difference is that SVC fits the intercept directly (without using the dummy feature trick). So the intercept is not pena

[Scikit-learn-general] Differences between SVC(kernel='linear') and LinearSVC

2012-08-12 Thread Andreas Mueller
Hi everybody. Yesterday I noticed big differences in performance between SVC with linear kernel and LinearSVC. I vaguely remember there was an issue about that, but can't find it any more. I tried to set the stopping criterion very strict but still I saw a big difference. Does any one have an e

Re: [Scikit-learn-general] No methods seem to predict well

2012-08-12 Thread Andreas Mueller
Just a small comments: You don't need to `fit` the models before using ``cross_valid_score``. They are refit for each split anew. Btw, have you tried normalizing your responses? Cheers, Andy On 08/12/2012 07:02 AM, Zach Bastick wrote: Sorry about that, the RTF reader is from the Pyth library:

Re: [Scikit-learn-general] encoding with TfidfVectorizer

2012-08-12 Thread Andreas Mueller
Hi Zach. I am no expert on the text extraction module but I'm pretty sure your guess is correct and this is a problem with the encoding of the file. You coult use the "charset error" option to just ignore these characters. See the docs here