On Fri, 2008-04-18 at 07:27 -0400, J. Clifford Dyer wrote:
> On Fri, 2008-04-18 at 10:28 +0100, [EMAIL PROTECTED] wrote:
> > On Thu, 17 Apr 2008 20:57:21 -0700 (PDT)
> > hdante <[EMAIL PROTECTED]> wrote:
> >
> > > Don't use old 8-bit encodings. Use UTF-8.
> >
> > Yes, I'll try. But is a problem when I only want to read, not that I'm
> > trying to write or create the content.
> > To blame I suppose is Microsoft's commercial success. They won't adhere to
> > standars if that doesn't make sense for their business.
> >
> > I'll change the approach trying to filter the contents with htmllib and
> > mapping on my own those troubling characters.
> > Anyway this has been a very instructive dive into unicode for me, I've got
> > things cleared up now.
> >
> > Thanks to everyone for the great help.
> >
>
> There are a number of code points (150 being one of them) that are used
> in cp1252, which are reserved for control characters in ISO-8859-1.
> Those characters will pretty much never be used in ISO-8859-1 documents.
> If you're expecting documents of both types coming in, test for the
> presence of those characters, and assume cp1252 for those documents.
>
> Something like:
>
> for c in control_chars:
> if c in encoded_text:
> unicode_text = encoded_text.decode('cp1252')
> break
> else:
> unicode_text = encoded_text.decode('latin-1')
>
> Note that the else matches the for, not the if.
>
> You can figure out the characters to match on by looking at the
> wikipedia pages for the encodings.
One warning: This works if you know all your documents are in one of
those two encodings, but you could break other encodings, like UTF-8
this way. Fortunately UTF-8 is a pretty fragile encoding, so it's easy
to break. You can usually test if a document is decent UTF-8 just by
wrapping it in a try except block:
try:
unicode_text = encoded.text.decode('utf-8')
except UnicodeEncodeError: # I think that's the proper exception
# do the stuff above
None of these are perfect methods, but then again, if text encoding
detection were a perfect science, python could just handle it on its
own.
If in doubt, prompt the user for confirmation.
Maybe others can share better "best practices."
Cheers,
Cliff
--
http://mail.python.org/mailman/listinfo/python-list