Re: [Scikit-learn-general] CountVectorizer needs too much input?

Ronnie Ghose Fri, 23 Nov 2012 14:32:21 -0800

....ehhh having that in the docs would be really helpful ._.


On 23 November 2012 17:06, Robert Layton <[email protected]> wrote:

> On 24 November 2012 09:02, Ronnie Ghose <[email protected]> wrote:
>
>> Eh... I see nothing about that here
>> http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
>>
>> .. what does fit() do then?
>>
>>
>> On 23 November 2012 16:57, Robert Layton <[email protected]> wrote:
>>
>>> On 24 November 2012 07:49, Doug Coleman <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I just want some n-grams--I don't necessarily want to tell
>>>> CountVectorizer my life story. It's pretty stingy about giving n-grams
>>>> unless you pass it a ton of data or something.
>>>>
>>>> Am I using it wrong? Are there kwargs that I missed that would support
>>>> this kind of use case?
>>>>
>>>> Thanks,
>>>> Doug
>>>>
>>>>
>>>> In [225]: cv = CountVectorizer(analyzer='char', stop_words=None,
>>>> ngram_range=(1,5))
>>>>
>>>> In [226]: cv.fit(['Gimme n-grams!'])
>>>>
>>>> ---------------------------------------------------------------------------
>>>> ValueError                                Traceback (most recent call
>>>> last)
>>>> <ipython-input-226-ccd2238d644b> in <module>()
>>>> ----> 1 cv.fit(['Gimme n-grams!'])
>>>>
>>>> /usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
>>>> in fit(self, raw_documents, y)
>>>>     430         self
>>>>     431         """
>>>> --> 432         self.fit_transform(raw_documents)
>>>>     433         return self
>>>>     434
>>>>
>>>> /usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
>>>> in fit_transform(self, raw_documents, y)
>>>>     518         vocab = dict(((t, i) for i, t in
>>>> enumerate(sorted(terms))))
>>>>     519         if not vocab:
>>>> --> 520             raise ValueError("empty vocabulary; training set
>>>> may have"
>>>>     521                              " contained only stop words")
>>>>     522         self.vocabulary_ = vocab
>>>>
>>>> ValueError: empty vocabulary; training set may have contained only stop
>>>> words
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Monitor your physical, virtual and cloud infrastructure from a single
>>>> web console. Get in-depth insight into apps, servers, databases, vmware,
>>>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>>>> Pricing starts from $795 for 25 servers or applications!
>>>> http://p.sf.net/sfu/zoho_dev2dev_nov
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>> For very simple cases like this, try:
>>> >>> cv = CV(analyzer='char', stop_words=None, ngram_range=(1,5))
>>> >>> cv._char_ngrams('Gimme n-grams!')
>>> ['G', 'i', 'm', 'm', 'e', ' ', 'n', '-', 'g', 'r', 'a', 'm', 's', '!',
>>> 'Gi', 'im', 'mm', 'me', 'e ', ' n', 'n-', '-g', 'gr', 'ra', 'am', 'ms',
>>> 's!', 'Gim', 'imm', 'mme', 'me ', 'e n', ' n-', 'n-g', '-gr', 'gra', 'ram',
>>> 'ams', 'ms!', 'Gimm', 'imme', 'mme ', 'me n', 'e n-', ' n-g', 'n-gr',
>>> '-gra', 'gram', 'rams', 'ams!', 'Gimme', 'imme ', 'mme n', 'me n-', 'e
>>> n-g', ' n-gr', 'n-gra', '-gram', 'grams', 'rams!']
>>> >>> documents = ['document1', 'anotherdoc2', 'yetanother3', 'onemore4']
>>> >>> map(cv._char_ngrams, documents)
>>> [['d', 'o', 'c', 'u', 'm', 'e', 'n', 't', '1', 'do', 'oc', 'cu', 'um',
>>> 'me', 'en', 'nt', 't1', 'doc', 'ocu', 'cum', 'ume', 'men', 'ent', 'nt1',
>>> 'docu', 'ocum', 'cume', 'umen', 'ment', 'ent1', 'docum', 'ocume', 'cumen',
>>> 'ument', 'ment1'], ['a', 'n', 'o', 't', 'h', 'e', 'r', 'd', 'o', 'c', '2',
>>> 'an', 'no', 'ot', 'th', 'he', 'er', 'rd', 'do', 'oc', 'c2', 'ano', 'not',
>>> 'oth', 'the', 'her', 'erd', 'rdo', 'doc', 'oc2', 'anot', 'noth', 'othe',
>>> 'ther', 'herd', 'erdo', 'rdoc', 'doc2', 'anoth', 'nothe', 'other', 'therd',
>>> 'herdo', 'erdoc', 'rdoc2'], ['y', 'e', 't', 'a', 'n', 'o', 't', 'h', 'e',
>>> 'r', '3', 'ye', 'et', 'ta', 'an', 'no', 'ot', 'th', 'he', 'er', 'r3',
>>> 'yet', 'eta', 'tan', 'ano', 'not', 'oth', 'the', 'her', 'er3', 'yeta',
>>> 'etan', 'tano', 'anot', 'noth', 'othe', 'ther', 'her3', 'yetan', 'etano',
>>> 'tanot', 'anoth', 'nothe', 'other', 'ther3'], ['o', 'n', 'e', 'm', 'o',
>>> 'r', 'e', '4', 'on', 'ne', 'em', 'mo', 'or', 're', 'e4', 'one', 'nem',
>>> 'emo', 'mor', 'ore', 're4', 'onem', 'nemo', 'emor', 'more', 'ore4',
>>> 'onemo', 'nemor', 'emore', 'more4']]
>>>
>>> (no need to fit!)
>>>
>>>
>>> There is another way:
>>> >>> analyser = cv.build_analyzer()
>>> >>> analyser(documents)
>>> >>> map(analyser, documents)
>>> [[u'd', u'o', u'c', u'u', u'm', u'e', u'n', u't', u'1', u'do', u'oc',
>>> u'cu', u'um', u'me', u'en', u'nt', u't1', u'doc', u'ocu', u'cum', u'ume',
>>> u'men', u'ent', u'nt1', u'docu', u'ocum', u'cume', u'umen', u'ment',
>>> u'ent1', u'docum', u'ocume', u'cumen', u'ument', u'ment1'], [u'a', u'n',
>>> u'o', u't', u'h', u'e', u'r', u'd', u'o', u'c', u'2', u'an', u'no', u'ot',
>>> u'th', u'he', u'er', u'rd', u'do', u'oc', u'c2', u'ano', u'not', u'oth',
>>> u'the', u'her', u'erd', u'rdo', u'doc', u'oc2', u'anot', u'noth', u'othe',
>>> u'ther', u'herd', u'erdo', u'rdoc', u'doc2', u'anoth', u'nothe', u'other',
>>> u'therd', u'herdo', u'erdoc', u'rdoc2'], [u'y', u'e', u't', u'a', u'n',
>>> u'o', u't', u'h', u'e', u'r', u'3', u'ye', u'et', u'ta', u'an', u'no',
>>> u'ot', u'th', u'he', u'er', u'r3', u'yet', u'eta', u'tan', u'ano', u'not',
>>> u'oth', u'the', u'her', u'er3', u'yeta', u'etan', u'tano', u'anot',
>>> u'noth', u'othe', u'ther', u'her3', u'yetan', u'etano', u'tanot', u'anoth',
>>> u'nothe', u'other', u'ther3'], [u'o', u'n', u'e', u'm', u'o', u'r', u'e',
>>> u'4', u'on', u'ne', u'em', u'mo', u'or', u're', u'e4', u'one', u'nem',
>>> u'emo', u'mor', u'ore', u're4', u'onem', u'nemo', u'emor', u'more',
>>> u'ore4', u'onemo', u'nemor', u'emore', u'more4']]
>>>
>>>
>>> This version, which is probably a bit cleaner, can be found here:
>>> http://scikit-learn.org/dev/modules/feature_extraction.html (CTRL + F
>>> for 'bigram_vectorizer = CountVectorizer')
>>>
>>> Hope that helps,
>>>
>>> Robert
>>>
>>> --
>>>
>>> Public key at: http://pgp.mit.edu/ Search for this email address and
>>> select the key from "2011-08-19" (key id: 54BA8735)
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Monitor your physical, virtual and cloud infrastructure from a single
>>> web console. Get in-depth insight into apps, servers, databases, vmware,
>>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>>> Pricing starts from $795 for 25 servers or applications!
>>> http://p.sf.net/sfu/zoho_dev2dev_nov
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Monitor your physical, virtual and cloud infrastructure from a single
>> web console. Get in-depth insight into apps, servers, databases, vmware,
>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>> Pricing starts from $795 for 25 servers or applications!
>> http://p.sf.net/sfu/zoho_dev2dev_nov
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
> It is there, but not highlighted enough:
>     build_analyzer()
>         Return a callable that handles preprocessing and tokenization
>
>
> fit() is for application in a data mining methodology:
>
> 1) Split documents into training and testing sets.
> 2) fit(training_set)
> 3) transform(testing_set)
>
> This creates a list of tokens from training_set and only looks for those
> tokens in testing_set (i.e. ignoring any 'new' tokens).
>
>
> --
>
> Public key at: http://pgp.mit.edu/ Search for this email address and
> select the key from "2011-08-19" (key id: 54BA8735)
>
>
>
> ------------------------------------------------------------------------------
> Monitor your physical, virtual and cloud infrastructure from a single
> web console. Get in-depth insight into apps, servers, databases, vmware,
> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
> Pricing starts from $795 for 25 servers or applications!
> http://p.sf.net/sfu/zoho_dev2dev_nov
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] CountVectorizer needs too much input?

Reply via email to