Re: [Scikit-learn-general] CountVectorizer needs too much input?

Robert Layton Fri, 23 Nov 2012 13:58:15 -0800

On 24 November 2012 07:49, Doug Coleman <[email protected]> wrote:


> Hi,
>
> I just want some n-grams--I don't necessarily want to tell CountVectorizer
> my life story. It's pretty stingy about giving n-grams unless you pass it a
> ton of data or something.
>
> Am I using it wrong? Are there kwargs that I missed that would support
> this kind of use case?
>
> Thanks,
> Doug
>
>
> In [225]: cv = CountVectorizer(analyzer='char', stop_words=None,
> ngram_range=(1,5))
>
> In [226]: cv.fit(['Gimme n-grams!'])
> ---------------------------------------------------------------------------
> ValueError                                Traceback (most recent call last)
> <ipython-input-226-ccd2238d644b> in <module>()
> ----> 1 cv.fit(['Gimme n-grams!'])
>
> /usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
> in fit(self, raw_documents, y)
>     430         self
>     431         """
> --> 432         self.fit_transform(raw_documents)
>     433         return self
>     434
>
> /usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
> in fit_transform(self, raw_documents, y)
>     518         vocab = dict(((t, i) for i, t in enumerate(sorted(terms))))
>     519         if not vocab:
> --> 520             raise ValueError("empty vocabulary; training set may
> have"
>     521                              " contained only stop words")
>     522         self.vocabulary_ = vocab
>
> ValueError: empty vocabulary; training set may have contained only stop
> words
>
>
>
> ------------------------------------------------------------------------------
> Monitor your physical, virtual and cloud infrastructure from a single
> web console. Get in-depth insight into apps, servers, databases, vmware,
> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
> Pricing starts from $795 for 25 servers or applications!
> http://p.sf.net/sfu/zoho_dev2dev_nov
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

For very simple cases like this, try:
>>> cv = CV(analyzer='char', stop_words=None, ngram_range=(1,5))
>>> cv._char_ngrams('Gimme n-grams!')
['G', 'i', 'm', 'm', 'e', ' ', 'n', '-', 'g', 'r', 'a', 'm', 's', '!',
'Gi', 'im', 'mm', 'me', 'e ', ' n', 'n-', '-g', 'gr', 'ra', 'am', 'ms',
's!', 'Gim', 'imm', 'mme', 'me ', 'e n', ' n-', 'n-g', '-gr', 'gra', 'ram',
'ams', 'ms!', 'Gimm', 'imme', 'mme ', 'me n', 'e n-', ' n-g', 'n-gr',
'-gra', 'gram', 'rams', 'ams!', 'Gimme', 'imme ', 'mme n', 'me n-', 'e
n-g', ' n-gr', 'n-gra', '-gram', 'grams', 'rams!']
>>> documents = ['document1', 'anotherdoc2', 'yetanother3', 'onemore4']
>>> map(cv._char_ngrams, documents)
[['d', 'o', 'c', 'u', 'm', 'e', 'n', 't', '1', 'do', 'oc', 'cu', 'um',
'me', 'en', 'nt', 't1', 'doc', 'ocu', 'cum', 'ume', 'men', 'ent', 'nt1',
'docu', 'ocum', 'cume', 'umen', 'ment', 'ent1', 'docum', 'ocume', 'cumen',
'ument', 'ment1'], ['a', 'n', 'o', 't', 'h', 'e', 'r', 'd', 'o', 'c', '2',
'an', 'no', 'ot', 'th', 'he', 'er', 'rd', 'do', 'oc', 'c2', 'ano', 'not',
'oth', 'the', 'her', 'erd', 'rdo', 'doc', 'oc2', 'anot', 'noth', 'othe',
'ther', 'herd', 'erdo', 'rdoc', 'doc2', 'anoth', 'nothe', 'other', 'therd',
'herdo', 'erdoc', 'rdoc2'], ['y', 'e', 't', 'a', 'n', 'o', 't', 'h', 'e',
'r', '3', 'ye', 'et', 'ta', 'an', 'no', 'ot', 'th', 'he', 'er', 'r3',
'yet', 'eta', 'tan', 'ano', 'not', 'oth', 'the', 'her', 'er3', 'yeta',
'etan', 'tano', 'anot', 'noth', 'othe', 'ther', 'her3', 'yetan', 'etano',
'tanot', 'anoth', 'nothe', 'other', 'ther3'], ['o', 'n', 'e', 'm', 'o',
'r', 'e', '4', 'on', 'ne', 'em', 'mo', 'or', 're', 'e4', 'one', 'nem',
'emo', 'mor', 'ore', 're4', 'onem', 'nemo', 'emor', 'more', 'ore4',
'onemo', 'nemor', 'emore', 'more4']]

(no need to fit!)


There is another way:
>>> analyser = cv.build_analyzer()
>>> analyser(documents)
>>> map(analyser, documents)
[[u'd', u'o', u'c', u'u', u'm', u'e', u'n', u't', u'1', u'do', u'oc',
u'cu', u'um', u'me', u'en', u'nt', u't1', u'doc', u'ocu', u'cum', u'ume',
u'men', u'ent', u'nt1', u'docu', u'ocum', u'cume', u'umen', u'ment',
u'ent1', u'docum', u'ocume', u'cumen', u'ument', u'ment1'], [u'a', u'n',
u'o', u't', u'h', u'e', u'r', u'd', u'o', u'c', u'2', u'an', u'no', u'ot',
u'th', u'he', u'er', u'rd', u'do', u'oc', u'c2', u'ano', u'not', u'oth',
u'the', u'her', u'erd', u'rdo', u'doc', u'oc2', u'anot', u'noth', u'othe',
u'ther', u'herd', u'erdo', u'rdoc', u'doc2', u'anoth', u'nothe', u'other',
u'therd', u'herdo', u'erdoc', u'rdoc2'], [u'y', u'e', u't', u'a', u'n',
u'o', u't', u'h', u'e', u'r', u'3', u'ye', u'et', u'ta', u'an', u'no',
u'ot', u'th', u'he', u'er', u'r3', u'yet', u'eta', u'tan', u'ano', u'not',
u'oth', u'the', u'her', u'er3', u'yeta', u'etan', u'tano', u'anot',
u'noth', u'othe', u'ther', u'her3', u'yetan', u'etano', u'tanot', u'anoth',
u'nothe', u'other', u'ther3'], [u'o', u'n', u'e', u'm', u'o', u'r', u'e',
u'4', u'on', u'ne', u'em', u'mo', u'or', u're', u'e4', u'one', u'nem',
u'emo', u'mor', u'ore', u're4', u'onem', u'nemo', u'emor', u'more',
u'ore4', u'onemo', u'nemor', u'emore', u'more4']]


This version, which is probably a bit cleaner, can be found here:
http://scikit-learn.org/dev/modules/feature_extraction.html (CTRL + F for '
bigram_vectorizer = CountVectorizer')

Hope that helps,

Robert

-- 

Public key at: http://pgp.mit.edu/ Search for this email address and select
the key from "2011-08-19" (key id: 54BA8735)

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] CountVectorizer needs too much input?

Reply via email to