On Fri, 2008-04-18 at 10:28 +0100, [EMAIL PROTECTED] wrote:
> On Thu, 17 Apr 2008 20:57:21 -0700 (PDT)
> hdante <[EMAIL PROTECTED]> wrote:
>
> > Don't use old 8-bit encodings. Use UTF-8.
>
> Yes, I'll try. But is a problem when I only want to read, not that I'm trying
> to write or create the content.
> To blame I suppose is Microsoft's commercial success. They won't adhere to
> standars if that doesn't make sense for their business.
>
> I'll change the approach trying to filter the contents with htmllib and
> mapping on my own those troubling characters.
> Anyway this has been a very instructive dive into unicode for me, I've got
> things cleared up now.
>
> Thanks to everyone for the great help.
>
There are a number of code points (150 being one of them) that are used
in cp1252, which are reserved for control characters in ISO-8859-1.
Those characters will pretty much never be used in ISO-8859-1 documents.
If you're expecting documents of both types coming in, test for the
presence of those characters, and assume cp1252 for those documents.
Something like:
for c in control_chars:
if c in encoded_text:
unicode_text = encoded_text.decode('cp1252')
break
else:
unicode_text = encoded_text.decode('latin-1')
Note that the else matches the for, not the if.
You can figure out the characters to match on by looking at the
wikipedia pages for the encodings.
Cheers,
Cliff
--
http://mail.python.org/mailman/listinfo/python-list