On Thu, Nov 21, 2013 at 3:04 PM, Albert-Jan Roskam <fo...@yahoo.com> wrote: > > Today I had a csv file in utf-8 encoding, but part of the accented > characters were mangled. The data were scraped from a website and it > turned out that at least some of the data were mangled on the website > already. Bits of the text were actually cp1252 (or cp850), I think, > even though the webpage was in utf-8 Is there any package that helps > to correct such issues?
The links in the Wikipedia article may help: http://en.wikipedia.org/wiki/Charset_detection International Components for Unicode (ICU) does charset detection: http://userguide.icu-project.org/conversion/detection Python wrapper: http://pypi.python.org/pypi/PyICU http://packages.debian.org/wheezy/python-pyicu Example: import icu russian_text = u'Здесь некий текст на русском языке.' encoded_text = russian_text.encode('windows-1251') cd = icu.CharsetDetector() cd.setText(encoded_text) match = cd.detect() matches = cd.detectAll() >>> match.getName() 'windows-1251' >>> match.getConfidence() 33 >>> match.getLanguage() 'ru' >>> [m.getName() for m in matches] ['windows-1251', 'ISO-8859-6', 'ISO-8859-8-I', 'ISO-8859-8'] >>> [m.getConfidence() for m in matches] [33, 13, 8, 8] _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor