[Scikit-learn-general] Performance decrease in v0.15 CountVectorizer and/or TfidfTransformer

Matt Coursen Tue, 05 Aug 2014 10:26:35 -0700

I am using a simple text processing pipeline to perform sentiment
classification:


steps = [('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LogisticRegression())]
pipe = Pipeline(steps)

With v0.15 the cross validation scores peak around 0.67 and with v0.15 they
peak at 0.55.  This seems like a significant difference to me.  My
hyperparameter gridsearch is as follows:

params = {'vect__ngram_range': [(1,1), (1,2)],
    'vect__stop_words':['english',None],
    'tfidf__use_idf': [True, False],
    'clf__C': np.logspace(-1,2,3*3+1)}

I have repeated the experiment with other classifiers (linear SVM,
naive_bayes) and seen a similar drop between v0.15 and v0.14.  Is this a
bug or am I missing some hyperparameters that need to be tuned differently
in v0.15?

- Matt Coursen

------------------------------------------------------------------------------
Infragistics Professional
Build stunning WinForms apps today!
Reboot your WinForms applications with our WinForms controls. 
Build a bridge from your legacy apps to the future.
http://pubads.g.doubleclick.net/gampad/clk?id=153845071&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Performance decrease in v0.15 CountVectorizer and/or TfidfTransformer

Reply via email to