Re: [Scikit-learn-general] CountVectorizer needs too much input?

Doug Coleman Fri, 23 Nov 2012 22:03:45 -0800

I think I need to use fit() for my use case. Imagine that I have a party
with N attendees with known names. I would like to take a picture of their
nametag, run an OCR algorithm and get their names back (with errors from
the OCR software) and then using n-grams, classify their names.


It would seem that I could use a pipeline of CountVectorizer -> Tf-Idf ->
Classifier. However, there are arbitrary limitations with CountVectorizer
that I'm running into, even if I tell it not to use stop words.


Here are some examples:

1) I have a single name to recognize, ngram_range 1 to 5 and I told it that
there are no stop words. I realize that everything should probably classify
to this one label no matter what, so trivially I want it to output this in
the transform no matter what. Instead, there's an error.

In [244]: cv = CountVectorizer(analyzer='char', stop_words=None,
ngram_range=(1,5))

In [245]: cv.fit('Nimrod')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-245-1c717902dd51> in <module>()
----> 1 cv.fit('Nimrod')

/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
in fit(self, raw_documents, y)
    430         self
    431         """
--> 432         self.fit_transform(raw_documents)
    433         return self
    434

/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
in fit_transform(self, raw_documents, y)
    518         vocab = dict(((t, i) for i, t in enumerate(sorted(terms))))
    519         if not vocab:
--> 520             raise ValueError("empty vocabulary; training set may
have"
    521                              " contained only stop words")
    522         self.vocabulary_ = vocab

ValueError: empty vocabulary; training set may have contained only stop
words



1b) It doesn't like one word, so how about two? Well, it depends on your
words. Why should it depend on how many/which words whether it works or not?

In [264]: cv.fit(['Doug', 'Lol'])
Out[264]:
CountVectorizer(analyzer='char', binary=False, charset='utf-8',
        charset_error='strict', dtype=<type 'long'>, input='content',
        lowercase=True, max_df=1.0, max_features=None, max_n=None,
        min_df=2, min_n=None, ngram_range=(1, 5), preprocessor=None,
        stop_words=None, strip_accents=None,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None,
        vocabulary=None)

In [263]: cv.fit('Nimrod', 'Able')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-263-e83fb70597be> in <module>()
----> 1 cv.fit('Nimrod', 'Able')

/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
in fit(self, raw_documents, y)
    430         self
    431         """
--> 432         self.fit_transform(raw_documents)
    433         return self
    434

/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
in fit_transform(self, raw_documents, y)
    518         vocab = dict(((t, i) for i, t in enumerate(sorted(terms))))
    519         if not vocab:
--> 520             raise ValueError("empty vocabulary; training set may
have"
    521                              " contained only stop words")
    522         self.vocabulary_ = vocab

ValueError: empty vocabulary; training set may have contained only stop
words



1c) N-grams of size 1 are kind of silly, so let's do 2-5 instead. Now the
input that worked above doesn't work. How much data is enough and why
should someone have to guess when stop words are supposed to be disabled?
I'd rather use stop words in tf-idf anyway.


[269]: cv = CountVectorizer(analyzer='char', stop_words=None,
ngram_range=(2,5))

In [272]: cv.fit(['Doug', 'Lol'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-272-d0f64987547d> in <module>()
----> 1 cv.fit(['Doug', 'Lol'])

/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
in fit(self, raw_documents, y)
    430         self
    431         """
--> 432         self.fit_transform(raw_documents)
    433         return self
    434

/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
in fit_transform(self, raw_documents, y)
    518         vocab = dict(((t, i) for i, t in enumerate(sorted(terms))))
    519         if not vocab:
--> 520             raise ValueError("empty vocabulary; training set may
have"
    521                              " contained only stop words")
    522         self.vocabulary_ = vocab

ValueError: empty vocabulary; training set may have contained only stop
words




2) A pipeline that outputs sparse arrays throws an error with
DecisionTreeClassifiers in array2d() but works with SGDClassifier.


from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline


X = ["hi", "hello", "greetings", "salutations", "sup", "aloha", "hola",
"bye", "adios", "later", "seeya","goodbye"]
y = [0,0,0,0,0,0,0,1,1,1,1,1]
pipeline = Pipeline([
    ('vect', CountVectorizer(stop_words=None, ngram_range=(1,3),
analyzer='char')),
    ('tfidf', TfidfTransformer()),
    ('clf', DecisionTreeClassifier()),
])
pipeline.fit(X, y)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-39-13fd9d2161fb> in <module>()
----> 1 pipeline.fit(X, y)

/usr/local/lib/python2.7/site-packages/sklearn/pipeline.pyc in fit(self, X,
y, **fit_params)
    124         """
    125         Xt, fit_params = self._pre_transform(X, y, **fit_params)
--> 126         self.steps[-1][-1].fit(Xt, y, **fit_params)
    127         return self
    128

/usr/local/lib/python2.7/site-packages/sklearn/tree/tree.pyc in fit(self,
X, y, sample_mask, X_argsorted, check_input)
    221         if getattr(X, "dtype", None) != DTYPE or \
    222            X.ndim != 2 or not X.flags.fortran:
--> 223             X = array2d(X, dtype=DTYPE, order="F")
    224
    225         n_samples, self.n_features_ = X.shape

/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.pyc in
array2d(X, dtype, order, copy)
     69     """Returns at least 2-d array with data from X"""
     70     if sparse.issparse(X):
---> 71         raise TypeError('A sparse matrix was passed, but dense data
'
     72                         'is required. Use X.todense() to convert to
dense.')
     73     X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)

TypeError: A sparse matrix was passed, but dense data is required. Use
X.todense() to convert to dense.



Yet this works with SGDClassifier (I haven't tried others):

X = ["hi", "hello", "greetings", "salutations", "sup", "aloha", "hola",
"bye", "adios", "later", "seeya","goodbye"]
y = [0,0,0,0,0,0,0,1,1,1,1,1]
pipeline = Pipeline([
    ('vect', CountVectorizer(stop_words=None, ngram_range=(1,3),
analyzer='char')),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])
pipeline.fit(X, y)

Out[67]:
Pipeline(steps=[('vect', CountVectorizer(analyzer='char', binary=False,
charset='utf-8',
        charset_error='strict', dtype=<type 'long'>, input='content',
        lowercase=True, max_df=1.0, max_features=None, max_n=None,
        min_df=2, min_n=None, ngram_range=(1, 3), preprocessor=None,
        stop_w...ower_t=0.5,
       random_state=None, rho=None, shuffle=False, verbose=0,
       warm_start=False))])



Finally, these are contrived examples, but I think they demonstrate the
issues I've found so far.

Thanks,
Doug

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] CountVectorizer needs too much input?

Reply via email to