I think I need to use fit() for my use case. Imagine that I have a party
with N attendees with known names. I would like to take a picture of their
nametag, run an OCR algorithm and get their names back (with errors from
the OCR software) and then using n-grams, classify their names.
It would seem that I could use a pipeline of CountVectorizer -> Tf-Idf ->
Classifier. However, there are arbitrary limitations with CountVectorizer
that I'm running into, even if I tell it not to use stop words.
Here are some examples:
1) I have a single name to recognize, ngram_range 1 to 5 and I told it that
there are no stop words. I realize that everything should probably classify
to this one label no matter what, so trivially I want it to output this in
the transform no matter what. Instead, there's an error.
In [244]: cv = CountVectorizer(analyzer='char', stop_words=None,
ngram_range=(1,5))
In [245]: cv.fit('Nimrod')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-245-1c717902dd51> in <module>()
----> 1 cv.fit('Nimrod')
/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
in fit(self, raw_documents, y)
430 self
431 """
--> 432 self.fit_transform(raw_documents)
433 return self
434
/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
in fit_transform(self, raw_documents, y)
518 vocab = dict(((t, i) for i, t in enumerate(sorted(terms))))
519 if not vocab:
--> 520 raise ValueError("empty vocabulary; training set may
have"
521 " contained only stop words")
522 self.vocabulary_ = vocab
ValueError: empty vocabulary; training set may have contained only stop
words
1b) It doesn't like one word, so how about two? Well, it depends on your
words. Why should it depend on how many/which words whether it works or not?
In [264]: cv.fit(['Doug', 'Lol'])
Out[264]:
CountVectorizer(analyzer='char', binary=False, charset='utf-8',
charset_error='strict', dtype=<type 'long'>, input='content',
lowercase=True, max_df=1.0, max_features=None, max_n=None,
min_df=2, min_n=None, ngram_range=(1, 5), preprocessor=None,
stop_words=None, strip_accents=None,
token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None,
vocabulary=None)
In [263]: cv.fit('Nimrod', 'Able')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-263-e83fb70597be> in <module>()
----> 1 cv.fit('Nimrod', 'Able')
/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
in fit(self, raw_documents, y)
430 self
431 """
--> 432 self.fit_transform(raw_documents)
433 return self
434
/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
in fit_transform(self, raw_documents, y)
518 vocab = dict(((t, i) for i, t in enumerate(sorted(terms))))
519 if not vocab:
--> 520 raise ValueError("empty vocabulary; training set may
have"
521 " contained only stop words")
522 self.vocabulary_ = vocab
ValueError: empty vocabulary; training set may have contained only stop
words
1c) N-grams of size 1 are kind of silly, so let's do 2-5 instead. Now the
input that worked above doesn't work. How much data is enough and why
should someone have to guess when stop words are supposed to be disabled?
I'd rather use stop words in tf-idf anyway.
[269]: cv = CountVectorizer(analyzer='char', stop_words=None,
ngram_range=(2,5))
In [272]: cv.fit(['Doug', 'Lol'])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-272-d0f64987547d> in <module>()
----> 1 cv.fit(['Doug', 'Lol'])
/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
in fit(self, raw_documents, y)
430 self
431 """
--> 432 self.fit_transform(raw_documents)
433 return self
434
/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.pyc
in fit_transform(self, raw_documents, y)
518 vocab = dict(((t, i) for i, t in enumerate(sorted(terms))))
519 if not vocab:
--> 520 raise ValueError("empty vocabulary; training set may
have"
521 " contained only stop words")
522 self.vocabulary_ = vocab
ValueError: empty vocabulary; training set may have contained only stop
words
2) A pipeline that outputs sparse arrays throws an error with
DecisionTreeClassifiers in array2d() but works with SGDClassifier.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
X = ["hi", "hello", "greetings", "salutations", "sup", "aloha", "hola",
"bye", "adios", "later", "seeya","goodbye"]
y = [0,0,0,0,0,0,0,1,1,1,1,1]
pipeline = Pipeline([
('vect', CountVectorizer(stop_words=None, ngram_range=(1,3),
analyzer='char')),
('tfidf', TfidfTransformer()),
('clf', DecisionTreeClassifier()),
])
pipeline.fit(X, y)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-39-13fd9d2161fb> in <module>()
----> 1 pipeline.fit(X, y)
/usr/local/lib/python2.7/site-packages/sklearn/pipeline.pyc in fit(self, X,
y, **fit_params)
124 """
125 Xt, fit_params = self._pre_transform(X, y, **fit_params)
--> 126 self.steps[-1][-1].fit(Xt, y, **fit_params)
127 return self
128
/usr/local/lib/python2.7/site-packages/sklearn/tree/tree.pyc in fit(self,
X, y, sample_mask, X_argsorted, check_input)
221 if getattr(X, "dtype", None) != DTYPE or \
222 X.ndim != 2 or not X.flags.fortran:
--> 223 X = array2d(X, dtype=DTYPE, order="F")
224
225 n_samples, self.n_features_ = X.shape
/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.pyc in
array2d(X, dtype, order, copy)
69 """Returns at least 2-d array with data from X"""
70 if sparse.issparse(X):
---> 71 raise TypeError('A sparse matrix was passed, but dense data
'
72 'is required. Use X.todense() to convert to
dense.')
73 X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
TypeError: A sparse matrix was passed, but dense data is required. Use
X.todense() to convert to dense.
Yet this works with SGDClassifier (I haven't tried others):
X = ["hi", "hello", "greetings", "salutations", "sup", "aloha", "hola",
"bye", "adios", "later", "seeya","goodbye"]
y = [0,0,0,0,0,0,0,1,1,1,1,1]
pipeline = Pipeline([
('vect', CountVectorizer(stop_words=None, ngram_range=(1,3),
analyzer='char')),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
pipeline.fit(X, y)
Out[67]:
Pipeline(steps=[('vect', CountVectorizer(analyzer='char', binary=False,
charset='utf-8',
charset_error='strict', dtype=<type 'long'>, input='content',
lowercase=True, max_df=1.0, max_features=None, max_n=None,
min_df=2, min_n=None, ngram_range=(1, 3), preprocessor=None,
stop_w...ower_t=0.5,
random_state=None, rho=None, shuffle=False, verbose=0,
warm_start=False))])
Finally, these are contrived examples, but I think they demonstrate the
issues I've found so far.
Thanks,
Doug
------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general