Re: [Scikit-learn-general] CountVectorizer needs too much input?

Robert Layton Fri, 23 Nov 2012 14:07:47 -0800

On 24 November 2012 09:02, Ronnie Ghose <[email protected]> wrote:


> Eh... I see nothing about that here
> http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
>
> .. what does fit() do then?
>
>
> On 23 November 2012 16:57, Robert Layton <[email protected]> wrote:
>
>> On 24 November 2012 07:49, Doug Coleman <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I just want some n-grams--I don't necessarily want to tell
>>> CountVectorizer my life story. It's pretty stingy about giving n-grams
>>> unless you pass it a ton of data or something.
>>>
>>> Am I using it wrong? Are there kwargs that I missed that would support
>>> this kind of use case?
>>>
>>> Thanks,
>>> Doug
>>>
>>>
>>> In [225]: cv = CountVectorizer(analyzer='char', stop_words=None,
>>> ngram_range=(1,5))
>>>
>>> In [226]: cv.fit(['Gimme n-grams!'])
>>>
>>> ---------------------------------------------------------------------------
>>> ValueError                                Traceback (most recent call
>>> last)
>>> <ipython-input-226-ccd2238d644b> in <module>()
>>> ----> 1 cv.fit(['Gimme n-grams!'])
>>>
>>> /usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
>>> in fit(self, raw_documents, y)
>>>     430         self
>>>     431         """
>>> --> 432         self.fit_transform(raw_documents)
>>>     433         return self
>>>     434
>>>
>>> /usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
>>> in fit_transform(self, raw_documents, y)
>>>     518         vocab = dict(((t, i) for i, t in
>>> enumerate(sorted(terms))))
>>>     519         if not vocab:
>>> --> 520             raise ValueError("empty vocabulary; training set may
>>> have"
>>>     521                              " contained only stop words")
>>>     522         self.vocabulary_ = vocab
>>>
>>> ValueError: empty vocabulary; training set may have contained only stop
>>> words
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Monitor your physical, virtual and cloud infrastructure from a single
>>> web console. Get in-depth insight into apps, servers, databases, vmware,
>>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>>> Pricing starts from $795 for 25 servers or applications!
>>> http://p.sf.net/sfu/zoho_dev2dev_nov
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>> For very simple cases like this, try:
>> >>> cv = CV(analyzer='char', stop_words=None, ngram_range=(1,5))
>> >>> cv._char_ngrams('Gimme n-grams!')
>> ['G', 'i', 'm', 'm', 'e', ' ', 'n', '-', 'g', 'r', 'a', 'm', 's', '!',
>> 'Gi', 'im', 'mm', 'me', 'e ', ' n', 'n-', '-g', 'gr', 'ra', 'am', 'ms',
>> 's!', 'Gim', 'imm', 'mme', 'me ', 'e n', ' n-', 'n-g', '-gr', 'gra', 'ram',
>> 'ams', 'ms!', 'Gimm', 'imme', 'mme ', 'me n', 'e n-', ' n-g', 'n-gr',
>> '-gra', 'gram', 'rams', 'ams!', 'Gimme', 'imme ', 'mme n', 'me n-', 'e
>> n-g', ' n-gr', 'n-gra', '-gram', 'grams', 'rams!']
>> >>> documents = ['document1', 'anotherdoc2', 'yetanother3', 'onemore4']
>> >>> map(cv._char_ngrams, documents)
>> [['d', 'o', 'c', 'u', 'm', 'e', 'n', 't', '1', 'do', 'oc', 'cu', 'um',
>> 'me', 'en', 'nt', 't1', 'doc', 'ocu', 'cum', 'ume', 'men', 'ent', 'nt1',
>> 'docu', 'ocum', 'cume', 'umen', 'ment', 'ent1', 'docum', 'ocume', 'cumen',
>> 'ument', 'ment1'], ['a', 'n', 'o', 't', 'h', 'e', 'r', 'd', 'o', 'c', '2',
>> 'an', 'no', 'ot', 'th', 'he', 'er', 'rd', 'do', 'oc', 'c2', 'ano', 'not',
>> 'oth', 'the', 'her', 'erd', 'rdo', 'doc', 'oc2', 'anot', 'noth', 'othe',
>> 'ther', 'herd', 'erdo', 'rdoc', 'doc2', 'anoth', 'nothe', 'other', 'therd',
>> 'herdo', 'erdoc', 'rdoc2'], ['y', 'e', 't', 'a', 'n', 'o', 't', 'h', 'e',
>> 'r', '3', 'ye', 'et', 'ta', 'an', 'no', 'ot', 'th', 'he', 'er', 'r3',
>> 'yet', 'eta', 'tan', 'ano', 'not', 'oth', 'the', 'her', 'er3', 'yeta',
>> 'etan', 'tano', 'anot', 'noth', 'othe', 'ther', 'her3', 'yetan', 'etano',
>> 'tanot', 'anoth', 'nothe', 'other', 'ther3'], ['o', 'n', 'e', 'm', 'o',
>> 'r', 'e', '4', 'on', 'ne', 'em', 'mo', 'or', 're', 'e4', 'one', 'nem',
>> 'emo', 'mor', 'ore', 're4', 'onem', 'nemo', 'emor', 'more', 'ore4',
>> 'onemo', 'nemor', 'emore', 'more4']]
>>
>> (no need to fit!)
>>
>>
>> There is another way:
>> >>> analyser = cv.build_analyzer()
>> >>> analyser(documents)
>> >>> map(analyser, documents)
>> [[u'd', u'o', u'c', u'u', u'm', u'e', u'n', u't', u'1', u'do', u'oc',
>> u'cu', u'um', u'me', u'en', u'nt', u't1', u'doc', u'ocu', u'cum', u'ume',
>> u'men', u'ent', u'nt1', u'docu', u'ocum', u'cume', u'umen', u'ment',
>> u'ent1', u'docum', u'ocume', u'cumen', u'ument', u'ment1'], [u'a', u'n',
>> u'o', u't', u'h', u'e', u'r', u'd', u'o', u'c', u'2', u'an', u'no', u'ot',
>> u'th', u'he', u'er', u'rd', u'do', u'oc', u'c2', u'ano', u'not', u'oth',
>> u'the', u'her', u'erd', u'rdo', u'doc', u'oc2', u'anot', u'noth', u'othe',
>> u'ther', u'herd', u'erdo', u'rdoc', u'doc2', u'anoth', u'nothe', u'other',
>> u'therd', u'herdo', u'erdoc', u'rdoc2'], [u'y', u'e', u't', u'a', u'n',
>> u'o', u't', u'h', u'e', u'r', u'3', u'ye', u'et', u'ta', u'an', u'no',
>> u'ot', u'th', u'he', u'er', u'r3', u'yet', u'eta', u'tan', u'ano', u'not',
>> u'oth', u'the', u'her', u'er3', u'yeta', u'etan', u'tano', u'anot',
>> u'noth', u'othe', u'ther', u'her3', u'yetan', u'etano', u'tanot', u'anoth',
>> u'nothe', u'other', u'ther3'], [u'o', u'n', u'e', u'm', u'o', u'r', u'e',
>> u'4', u'on', u'ne', u'em', u'mo', u'or', u're', u'e4', u'one', u'nem',
>> u'emo', u'mor', u'ore', u're4', u'onem', u'nemo', u'emor', u'more',
>> u'ore4', u'onemo', u'nemor', u'emore', u'more4']]
>>
>>
>> This version, which is probably a bit cleaner, can be found here:
>> http://scikit-learn.org/dev/modules/feature_extraction.html (CTRL + F
>> for 'bigram_vectorizer = CountVectorizer')
>>
>> Hope that helps,
>>
>> Robert
>>
>> --
>>
>> Public key at: http://pgp.mit.edu/ Search for this email address and
>> select the key from "2011-08-19" (key id: 54BA8735)
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Monitor your physical, virtual and cloud infrastructure from a single
>> web console. Get in-depth insight into apps, servers, databases, vmware,
>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>> Pricing starts from $795 for 25 servers or applications!
>> http://p.sf.net/sfu/zoho_dev2dev_nov
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> Monitor your physical, virtual and cloud infrastructure from a single
> web console. Get in-depth insight into apps, servers, databases, vmware,
> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
> Pricing starts from $795 for 25 servers or applications!
> http://p.sf.net/sfu/zoho_dev2dev_nov
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
It is there, but not highlighted enough:
    build_analyzer()
        Return a callable that handles preprocessing and tokenization


fit() is for application in a data mining methodology:

1) Split documents into training and testing sets.
2) fit(training_set)
3) transform(testing_set)

This creates a list of tokens from training_set and only looks for those
tokens in testing_set (i.e. ignoring any 'new' tokens).

-- 

Public key at: http://pgp.mit.edu/ Search for this email address and select
the key from "2011-08-19" (key id: 54BA8735)

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] CountVectorizer needs too much input?

Reply via email to