[Scikit-learn-general] Vectorization/tokenization question...

Fred Mailhot Fri, 19 Jul 2013 19:30:34 -0700

Hello list...

I'm a huge fan of sklearn and use it daily at work. I was confused by the
results of some recent text classification experiments and started looking
more closely at the vectorization code.


I'm wondering about the logic behind:

1) not doing stopword removal for the char_wb analyzer in CountVectorizer?
(I'm using FeatureUnion to combine vectorizer for word and char ngrams, and
the char analyzer is getting tripped up on stopword ngrams)

and

2) padding tokens with a single space in the char_wb analyzer (I'm guessing
this is to disambiguate ngrams that occur at word boundaries from those
that don't, but why not pad with (n-1) spaces?)


Cheers & thanks for an awesome suite of tools!
Fred.

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Vectorization/tokenization question...

Reply via email to