2013/7/19 Fred Mailhot <[email protected]>: > Hello list... Hi Fred,
> I'm a huge fan of sklearn and use it daily at work. I was confused by the > results of some recent text classification experiments and started looking > more closely at the vectorization code. > > I'm wondering about the logic behind: > > 1) not doing stopword removal for the char_wb analyzer in CountVectorizer? I did not thought about it as stopwords are traditionally used with "real" words but I don't have any opposition against using the stopwords more consistently. Please feel free to submit a PR with the fix along with a new test case. > (I'm using FeatureUnion to combine vectorizer for word and char ngrams, and > the char analyzer is getting tripped up on stopword ngrams) I don't understand what you mean by that any example. > and > > 2) padding tokens with a single space in the char_wb analyzer (I'm guessing > this is to disambiguate ngrams that occur at word boundaries from those that > don't, Yes. > but why not pad with (n-1) spaces?) Why would you do that? That would (re)create char ngram features that are already generated by lower n ngrams. For instance if ngram_range=(3, 5), if you pad with more than one wight space you would generate 5 grams that are already generated by the 4-gram, only with a different feature name and thus column: that would add redundancy to the features without adding any new signal if I am correct. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
