2013/7/19 Fred Mailhot <[email protected]>:
> Hello list...

Hi Fred,

> I'm a huge fan of sklearn and use it daily at work. I was confused by the
> results of some recent text classification experiments and started looking
> more closely at the vectorization code.
>
> I'm wondering about the logic behind:
>
> 1) not doing stopword removal for the char_wb analyzer in CountVectorizer?

I did not thought about it as stopwords are traditionally used with
"real" words but I don't have any opposition against using the
stopwords more consistently. Please feel free to submit a PR with the
fix along with a new test case.

> (I'm using FeatureUnion to combine vectorizer for word and char ngrams, and
> the char analyzer is getting tripped up on stopword ngrams)

I don't understand what you mean by that any example.

> and
>
> 2) padding tokens with a single space in the char_wb analyzer (I'm guessing
> this is to disambiguate ngrams that occur at word boundaries from those that
> don't,

Yes.

> but why not pad with (n-1) spaces?)

Why would you do that? That would (re)create char ngram features that
are already generated by lower n ngrams.

For instance if ngram_range=(3, 5), if you pad with more than one
wight space you would generate 5 grams that are already generated by
the 4-gram, only with a different feature name and thus column: that
would add redundancy to the features without adding any new signal if
I am correct.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to