On 24 November 2011 22:51, Lars Buitinck <[email protected]> wrote:
> 2011/11/22 SK Sn <[email protected]>:
>> I looked into WordNGramAnalyzer in feature_extraction/text.py.
>>
>> It occured to me that in case of nGram n>1, 'handle token n-grams' happends
>> before 'handle stop words', as shown in following snippet:
>
> <snip>
>
>> At least it is strange to me that, especially when I define my own
>> stopwords, these stopwords should not appear in nGram either.
>> Is there any special consideration for such implementation? Thanks.
>
> Olivier wrote this, so maybe he can comment on it as well, but it
> seems like a mistake to me. Stop word filtering would require a list
> of stop n-grams in this case.
>
> The question is what we'd want to do:
> * compile a list of stop n-grams from the stop list, i.e. filter out
> "to be" but not "the president"
> * filter out the stop words prior to n-gram building, so that
> "president of France" yields the bigram "president France" (doesn't
> occur in the text)
> * filter out the stop words afterward so that so "president of France"
> doesn't yield any bigrams
>
> I haven't tried implementing any yet, but I think the second would be
> just a matter of moving some lines in the source code.
>
> I'm not aware of prior art in the literature, but maybe we could check
> how Lucene handles the combination of stop words and bigrams, being
> the de facto standard package for the tf-idf modeling that
> feature_extraction.text does?

I've worked a bit with Lucene so I may able to shed some light on how it works.

It uses a chain of filters applied on a token list, the user defines
in a configuration file.
The stopword filter can be used after or before applying a NGram filter.
Hence, the chain of tokens you will obtain after doing Tokenizer +
stopword + NGram will be different than the one done with Tokenizer +
NGram + StopWords.
This makes Lucene very flexible, but also quite complex to use.


>
> --
> Lars Buitinck
> Scientific programmer, ILPS
> University of Amsterdam
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure
> contains a definitive record of customers, application performance,
> security threats, fraudulent activity, and more. Splunk takes this
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to