2011/11/22 SK Sn <[email protected]>: > I looked into WordNGramAnalyzer in feature_extraction/text.py. > > It occured to me that in case of nGram n>1, 'handle token n-grams' happends > before 'handle stop words', as shown in following snippet:
<snip> > At least it is strange to me that, especially when I define my own > stopwords, these stopwords should not appear in nGram either. > Is there any special consideration for such implementation? Thanks. Olivier wrote this, so maybe he can comment on it as well, but it seems like a mistake to me. Stop word filtering would require a list of stop n-grams in this case. The question is what we'd want to do: * compile a list of stop n-grams from the stop list, i.e. filter out "to be" but not "the president" * filter out the stop words prior to n-gram building, so that "president of France" yields the bigram "president France" (doesn't occur in the text) * filter out the stop words afterward so that so "president of France" doesn't yield any bigrams I haven't tried implementing any yet, but I think the second would be just a matter of moving some lines in the source code. I'm not aware of prior art in the literature, but maybe we could check how Lucene handles the combination of stop words and bigrams, being the de facto standard package for the tf-idf modeling that feature_extraction.text does? -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
