Re: Help with character encodings

Gary Herron Tue, 20 May 2008 09:03:35 -0700

Gary Herron wrote:

A_H wrote:

Help!


I've scraped a PDF file for text and all the minus signs come back as
u'\xad'.

Is there any easy way I can change them all to plain old ASCII '-' ???

str.replace complained about a missing codec.



Hints?


Encoding it into a 'latin1' encoded string seems to work:

 >>> print u'\xad'.encode('latin1')
 -

That might be what you want, but really, it was not a very well thoughtanswer. Here's a better answer:




Using the unicodedata module, i see that the character you have  u'\xad' is

   SOFT HYPHEN (codepoint 173=0xad)

If you want to replace that with the more familiar HYPHEN-MINUS(codepoint 45) you can use the string replace, but stick will allunicode values so you don't provoke a conversion to an ascii encoded string


   >>> print u'ABC\xadDEF'.replace(u'\xad','-')
   ABC-DEF

But does this really solve your problem? If there is the possibilityfor other unicode characters in your data, this is heading down thewrong track, and the question (which I can't answer) becomes: What areyou going to do with the string?

If you are going to display it via a GUI that understands UTF-8, thenencode the string as utf8 and display it -- no need to convert thehyphens.If you are trying to display it somewhere that is not unicode (or UTF-8)aware, then you'll have to convert it. In that case, encoding it aslatin1 is probably a good choice, but beware: That does not convert theu'\xad' to an chr(45) (the usual HYPHEN-MINUS), but instead to chr(173)which (on latin1 aware applications) will display as the usual hyphen.In any case, it won't be ascii (in the strict sense that ascii is chr(0)through chr(127)). If you *really* *really* wanted straight strictascii, replace chr(173) with chr(45).


Gary Herron




--
http://mail.python.org/mailman/listinfo/python-list

Re: Help with character encodings

Reply via email to