Re: [Scikit-learn-general] Order of processes in WordNGramAnalyzer

Robert Layton Thu, 24 Nov 2011 14:25:30 -0800

On 25 November 2011 08:58, Nelle Varoquaux <[email protected]>wrote:


> On 24 November 2011 22:51, Lars Buitinck <[email protected]> wrote:
> > 2011/11/22 SK Sn <[email protected]>:
> >> I looked into WordNGramAnalyzer in feature_extraction/text.py.
> >>
> >> It occured to me that in case of nGram n>1, 'handle token n-grams'
> happends
> >> before 'handle stop words', as shown in following snippet:
> >
> > <snip>
> >
> >> At least it is strange to me that, especially when I define my own
> >> stopwords, these stopwords should not appear in nGram either.
> >> Is there any special consideration for such implementation? Thanks.
> >
> > Olivier wrote this, so maybe he can comment on it as well, but it
> > seems like a mistake to me. Stop word filtering would require a list
> > of stop n-grams in this case.
> >
> > The question is what we'd want to do:
> > * compile a list of stop n-grams from the stop list, i.e. filter out
> > "to be" but not "the president"
> > * filter out the stop words prior to n-gram building, so that
> > "president of France" yields the bigram "president France" (doesn't
> > occur in the text)
> > * filter out the stop words afterward so that so "president of France"
> > doesn't yield any bigrams
> >
> > I haven't tried implementing any yet, but I think the second would be
> > just a matter of moving some lines in the source code.
> >
> > I'm not aware of prior art in the literature, but maybe we could check
> > how Lucene handles the combination of stop words and bigrams, being
> > the de facto standard package for the tf-idf modeling that
> > feature_extraction.text does?
>
> I've worked a bit with Lucene so I may able to shed some light on how it
> works.
>
> It uses a chain of filters applied on a token list, the user defines
> in a configuration file.
> The stopword filter can be used after or before applying a NGram filter.
> Hence, the chain of tokens you will obtain after doing Tokenizer +
> stopword + NGram will be different than the one done with Tokenizer +
> NGram + StopWords.
> This makes Lucene very flexible, but also quite complex to use.
>
>
> >
> > --
> > Lars Buitinck
> > Scientific programmer, ILPS
> > University of Amsterdam
> >
> >
> ------------------------------------------------------------------------------
> > All the data continuously generated in your IT infrastructure
> > contains a definitive record of customers, application performance,
> > security threats, fraudulent activity, and more. Splunk takes this
> > data and makes sense of it. IT sense. And common sense.
> > http://p.sf.net/sfu/splunk-novd2d
> > _______________________________________________
> > Scikit-learn-general mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure
> contains a definitive record of customers, application performance,
> security threats, fraudulent activity, and more. Splunk takes this
> data and makes sense of it. IT sense. And common sense.
> http://p.sf.net/sfu/splunk-novd2d
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>


I work with n-grams every day, and I don't use sklearn's version at all.
I like the idea of chaining. Perhaps we can do a filter like operation?


    def extract(self, document, filters):
        if len(filters) == 0:
            return document
        f = filters.pop(0)
        return f(self.extract(document, filters))

filters is then a list of functions which takes a list as input and returns
a list as output. Common filters:

extract_words: takes a string (list of characters) as input, returns a list
of words as outputs
remove_stopwords: takes a list of words as input, returns a list of words
(minus stopwords) as output
extract_ngrams(n): takes a list of anything as input, returns all n-grams
for n=n (i.e. closure)

All of these can be set up as generators, which makes memory consumption
lower than it might otherwise be.

Thoughts?

Robert

-- 

Public key at: http://pgp.mit.edu/ Search for this email address and select
the key from "2011-08-19" (key id: 54BA8735)

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Order of processes in WordNGramAnalyzer

Reply via email to