"[EMAIL PROTECTED]" wrote: > Question: what is a good strategy for taking an 8bit > string of unknown encoding and recovering the largest > amount of reasonable information from it (translated to > utf8 if needed)? The string might be in any of the > myriad encodings that predate unicode. Has anyone > done this in Python already? The output must be clean > utf8 suitable for arbitrary xml parsers.
some alternatives: braindead bruteforce: try to do strict decoding as utf-8. if you succeed, you have an utf-8 string. if not, assume iso-8859-1. slightly smarter bruteforce: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/163743 more advanced (but possibly not good enough for very short texts): http://chardet.feedparser.org/ </F> -- http://mail.python.org/mailman/listinfo/python-list