2011/11/22 SK Sn <[email protected]>:
> I looked into WordNGramAnalyzer in feature_extraction/text.py.
>
> It occured to me that in case of nGram n>1, 'handle token n-grams' happends
> before 'handle stop words', as shown in following snippet:

<snip>

> At least it is strange to me that, especially when I define my own
> stopwords, these stopwords should not appear in nGram either.
> Is there any special consideration for such implementation? Thanks.

Olivier wrote this, so maybe he can comment on it as well, but it
seems like a mistake to me. Stop word filtering would require a list
of stop n-grams in this case.

The question is what we'd want to do:
* compile a list of stop n-grams from the stop list, i.e. filter out
"to be" but not "the president"
* filter out the stop words prior to n-gram building, so that
"president of France" yields the bigram "president France" (doesn't
occur in the text)
* filter out the stop words afterward so that so "president of France"
doesn't yield any bigrams

I haven't tried implementing any yet, but I think the second would be
just a matter of moving some lines in the source code.

I'm not aware of prior art in the literature, but maybe we could check
how Lucene handles the combination of stop words and bigrams, being
the de facto standard package for the tf-idf modeling that
feature_extraction.text does?

-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to