Re: Unicode chr(150) en dash

J. Clifford Dyer Fri, 18 Apr 2008 04:38:48 -0700

On Fri, 2008-04-18 at 07:27 -0400, J. Clifford Dyer wrote:
> On Fri, 2008-04-18 at 10:28 +0100, [EMAIL PROTECTED] wrote:
> > On Thu, 17 Apr 2008 20:57:21 -0700 (PDT)
> > hdante <[EMAIL PROTECTED]> wrote:
> > 
> > >  Don't use old 8-bit encodings. Use UTF-8.
> > 
> > Yes, I'll try. But is a problem when I only want to read, not that I'm 
> > trying to write or create the content.
> > To blame I suppose is Microsoft's commercial success. They won't adhere to 
> > standars if that doesn't make sense for their business.
> > 
> > I'll change the approach trying to filter the contents with htmllib and 
> > mapping on my own those troubling characters.
> > Anyway this has been a very instructive dive into unicode for me, I've got 
> > things cleared up now.
> > 
> > Thanks to everyone for the great help.
> > 
> 
> There are a number of code points (150 being one of them) that are used
> in cp1252, which are reserved for control characters in ISO-8859-1.
> Those characters will pretty much never be used in ISO-8859-1 documents.
> If you're expecting documents of both types coming in, test for the
> presence of those characters, and assume cp1252 for those documents.  
> 
> Something like:
> 
> for c in control_chars:
>     if c in encoded_text:
>       unicode_text = encoded_text.decode('cp1252')
>         break
> else:
>     unicode_text = encoded_text.decode('latin-1')
> 
> Note that the else matches the for, not the if.
> 
> You can figure out the characters to match on by looking at the
> wikipedia pages for the encodings.


One warning: This works if you know all your documents are in one of
those two encodings, but you could break other encodings, like UTF-8
this way.  Fortunately UTF-8 is a pretty fragile encoding, so it's easy
to break.  You can usually test if a document is decent UTF-8 just by
wrapping it in a try except block:

try:
    unicode_text = encoded.text.decode('utf-8')
except UnicodeEncodeError: # I think that's the proper exception
    # do the stuff above

None of these are perfect methods, but then again, if text encoding
detection were a perfect science, python could just handle it on its
own.

If in doubt, prompt the user for confirmation.

Maybe others can share better "best practices."

Cheers,
Cliff

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unicode chr(150) en dash

Reply via email to