Re: [Scikit-learn-general] CountVectorizer needs too much input?

Lars Buitinck Sat, 24 Nov 2012 02:35:15 -0800

2012/11/24 Doug Coleman <[email protected]>:
> I think I need to use fit() for my use case. Imagine that I have a party
> with N attendees with known names. I would like to take a picture of their
> nametag, run an OCR algorithm and get their names back (with errors from the
> OCR software) and then using n-grams, classify their names.
>
> It would seem that I could use a pipeline of CountVectorizer -> Tf-Idf ->
> Classifier. However, there are arbitrary limitations with CountVectorizer
> that I'm running into, even if I tell it not to use stop words.


CountVectorizer and tf-idf are for document processing so it shouldn't
be surprising that you'd have to do some work to use it in another
domain, though I admit the error messages you got could be better.


> In [244]: cv = CountVectorizer(analyzer='char', stop_words=None,
> ngram_range=(1,5))
>
> In [245]: cv.fit('Nimrod')
> ---------------------------------------------------------------------------
> ValueError                                Traceback (most recent call last)
> <ipython-input-245-1c717902dd51> in <module>()
> ----> 1 cv.fit('Nimrod')
>
> /usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
> in fit(self, raw_documents, y)
>     430         self
>     431         """
> --> 432         self.fit_transform(raw_documents)
>     433         return self
>     434
>
> /usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
> in fit_transform(self, raw_documents, y)
>     518         vocab = dict(((t, i) for i, t in enumerate(sorted(terms))))
>     519         if not vocab:
> --> 520             raise ValueError("empty vocabulary; training set may
> have"
>     521                              " contained only stop words")
>     522         self.vocabulary_ = vocab
>
> ValueError: empty vocabulary; training set may have contained only stop
> words

This is because CountVectorizer, by default, removes all words that
occur in only one input sample. In text classification, those are
assumed to have no discriminative power.

With the option min_df=1, you can fit a CountVectorizer on one sample,
but then it will remember only the n-grams from that sample and it
will give empty output for samples that share no n-grams with that
sample. You really need to feed it a complete training set to get
anything meaningful out of it.


> 2) A pipeline that outputs sparse arrays throws an error with
> DecisionTreeClassifiers in array2d() but works with SGDClassifier.

That's because decision trees don't support sparse input. I admit that
this is quite unfortunate.


-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] CountVectorizer needs too much input?

Reply via email to