On 24 November 2011 22:51, Lars Buitinck <[email protected]> wrote: > 2011/11/22 SK Sn <[email protected]>: >> I looked into WordNGramAnalyzer in feature_extraction/text.py. >> >> It occured to me that in case of nGram n>1, 'handle token n-grams' happends >> before 'handle stop words', as shown in following snippet: > > <snip> > >> At least it is strange to me that, especially when I define my own >> stopwords, these stopwords should not appear in nGram either. >> Is there any special consideration for such implementation? Thanks. > > Olivier wrote this, so maybe he can comment on it as well, but it > seems like a mistake to me. Stop word filtering would require a list > of stop n-grams in this case. > > The question is what we'd want to do: > * compile a list of stop n-grams from the stop list, i.e. filter out > "to be" but not "the president" > * filter out the stop words prior to n-gram building, so that > "president of France" yields the bigram "president France" (doesn't > occur in the text) > * filter out the stop words afterward so that so "president of France" > doesn't yield any bigrams > > I haven't tried implementing any yet, but I think the second would be > just a matter of moving some lines in the source code. > > I'm not aware of prior art in the literature, but maybe we could check > how Lucene handles the combination of stop words and bigrams, being > the de facto standard package for the tf-idf modeling that > feature_extraction.text does?
I've worked a bit with Lucene so I may able to shed some light on how it works. It uses a chain of filters applied on a token list, the user defines in a configuration file. The stopword filter can be used after or before applying a NGram filter. Hence, the chain of tokens you will obtain after doing Tokenizer + stopword + NGram will be different than the one done with Tokenizer + NGram + StopWords. This makes Lucene very flexible, but also quite complex to use. > > -- > Lars Buitinck > Scientific programmer, ILPS > University of Amsterdam > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure > contains a definitive record of customers, application performance, > security threats, fraudulent activity, and more. Splunk takes this > data and makes sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-novd2d > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
