On Fri, 2008-04-18 at 07:27 -0400, J. Clifford Dyer wrote: > On Fri, 2008-04-18 at 10:28 +0100, [EMAIL PROTECTED] wrote: > > On Thu, 17 Apr 2008 20:57:21 -0700 (PDT) > > hdante <[EMAIL PROTECTED]> wrote: > > > > > Don't use old 8-bit encodings. Use UTF-8. > > > > Yes, I'll try. But is a problem when I only want to read, not that I'm > > trying to write or create the content. > > To blame I suppose is Microsoft's commercial success. They won't adhere to > > standars if that doesn't make sense for their business. > > > > I'll change the approach trying to filter the contents with htmllib and > > mapping on my own those troubling characters. > > Anyway this has been a very instructive dive into unicode for me, I've got > > things cleared up now. > > > > Thanks to everyone for the great help. > > > > There are a number of code points (150 being one of them) that are used > in cp1252, which are reserved for control characters in ISO-8859-1. > Those characters will pretty much never be used in ISO-8859-1 documents. > If you're expecting documents of both types coming in, test for the > presence of those characters, and assume cp1252 for those documents. > > Something like: > > for c in control_chars: > if c in encoded_text: > unicode_text = encoded_text.decode('cp1252') > break > else: > unicode_text = encoded_text.decode('latin-1') > > Note that the else matches the for, not the if. > > You can figure out the characters to match on by looking at the > wikipedia pages for the encodings.
One warning: This works if you know all your documents are in one of those two encodings, but you could break other encodings, like UTF-8 this way. Fortunately UTF-8 is a pretty fragile encoding, so it's easy to break. You can usually test if a document is decent UTF-8 just by wrapping it in a try except block: try: unicode_text = encoded.text.decode('utf-8') except UnicodeEncodeError: # I think that's the proper exception # do the stuff above None of these are perfect methods, but then again, if text encoding detection were a perfect science, python could just handle it on its own. If in doubt, prompt the user for confirmation. Maybe others can share better "best practices." Cheers, Cliff -- http://mail.python.org/mailman/listinfo/python-list