Term frequency using scikit-learn's CountVectorizer

Abdul Abdul Sun, 16 Oct 2016 18:08:18 -0700

I have the following code snippet where I'm trying to list the term 
frequencies, where first_text and second_text are .tex documents:


from sklearn.feature_extraction.text import CountVectorizer
training_documents = (first_text, second_text)  
vectorizer = CountVectorizer()
vectorizer.fit_transform(training_documents)
print "Vocabulary:", vectorizer.vocabulary 
When I run the script, I get the following:

File "test.py", line 19, in <module>
    vectorizer.fit_transform(training_documents)
  File 
"/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", 
line 817, in fit_transform
    self.fixed_vocabulary_)
  File 
"/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", 
line 752, in _count_vocab
    for feature in analyze(doc):
  File 
"/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", 
line 238, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File 
"/usr/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", 
line 115, in decode
    doc = doc.decode(self.encoding, self.decode_error)
  File 
"/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py",
 line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa2 in position 200086: 
invalid start byte
How can I fix this issue?

Thanks.
-- 
https://mail.python.org/mailman/listinfo/python-list

Term frequency using scikit-learn's CountVectorizer

Reply via email to