2012/11/24 Doug Coleman <[email protected]>:
> I think I need to use fit() for my use case. Imagine that I have a party
> with N attendees with known names. I would like to take a picture of their
> nametag, run an OCR algorithm and get their names back (with errors from the
> OCR software) and then using n-grams, classify their names.
>
> It would seem that I could use a pipeline of CountVectorizer -> Tf-Idf ->
> Classifier. However, there are arbitrary limitations with CountVectorizer
> that I'm running into, even if I tell it not to use stop words.
CountVectorizer and tf-idf are for document processing so it shouldn't
be surprising that you'd have to do some work to use it in another
domain, though I admit the error messages you got could be better.
> In [244]: cv = CountVectorizer(analyzer='char', stop_words=None,
> ngram_range=(1,5))
>
> In [245]: cv.fit('Nimrod')
> ---------------------------------------------------------------------------
> ValueError Traceback (most recent call last)
> <ipython-input-245-1c717902dd51> in <module>()
> ----> 1 cv.fit('Nimrod')
>
> /usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
> in fit(self, raw_documents, y)
> 430 self
> 431 """
> --> 432 self.fit_transform(raw_documents)
> 433 return self
> 434
>
> /usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
> in fit_transform(self, raw_documents, y)
> 518 vocab = dict(((t, i) for i, t in enumerate(sorted(terms))))
> 519 if not vocab:
> --> 520 raise ValueError("empty vocabulary; training set may
> have"
> 521 " contained only stop words")
> 522 self.vocabulary_ = vocab
>
> ValueError: empty vocabulary; training set may have contained only stop
> words
This is because CountVectorizer, by default, removes all words that
occur in only one input sample. In text classification, those are
assumed to have no discriminative power.
With the option min_df=1, you can fit a CountVectorizer on one sample,
but then it will remember only the n-grams from that sample and it
will give empty output for samples that share no n-grams with that
sample. You really need to feed it a complete training set to get
anything meaningful out of it.
> 2) A pipeline that outputs sparse arrays throws an error with
> DecisionTreeClassifiers in array2d() but works with SGDClassifier.
That's because decision trees don't support sparse input. I admit that
this is quite unfortunate.
--
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam
------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general