Jeremy Reichman wrote:
I have some characters in line strings in a file I'm processing that appear
to be Unicode. (When I print them to the shell from my script, they are
Asian characters for files like fonts in the Mac OS X filesystem.)

When I run a.split() on the affected line strings, they split on what I'm
guessing is considered a Unicode whitespace character. Specifically, the
culprit seems to be '\xe1':

$ python -c 'print "\xe1"'
?

actually, u'xe1' is a lower case accented a: á (if the unicode comes through email OK), so I doubt that python is splitting on that.

Also, when you do the above, you're creating a regular string, not a unicode object. If you do:

$ python -c 'print u"\xe1"'
á

You may get the right thing, if you're terminal is set up right to display unicode.

I suspect your problem is that you aren't decoding the input file correctly. The whole problem with unicode (and indeed, any non-ascii encoding), is that you need to know what encoding your data is, in order to use it. if it looks mostly OK when interpreted as ASCII, then in MIGHT be utf8, so try reading in your file and decoding it this way:

contents = myfile.read().decode('utf8')

Then do your splitting. If it's not utf8, then you'll need to figure out what it is.

First, read this:
http://www.joelonsoftware.com/articles/Unicode.html

then take a look at some of the python unicode tutorials, this is only one of them:

http://www.reportlab.com/i18n/python_unicode_tutorial.html

there are other good ones.

-Chris



--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

[EMAIL PROTECTED]

_______________________________________________
Pythonmac-SIG maillist  -  Pythonmac-SIG@python.org
http://mail.python.org/mailman/listinfo/pythonmac-sig

Reply via email to