Re: string processing question

Scott David Daniels Fri, 01 May 2009 09:05:58 -0700

Kurt Mueller wrote:

Scott David Daniels schrieb:

To discover what is happening, try something like:
    python -c 'for a in "ä", unicode("ä"): print len(a), a'


I suspect that in your encoding, "ä" is two bytes long, and in
unicode it is converted to to a single character.


:> python -c 'for a in "ä", unicode("ä", "utf8"): print len(a), a'
2 ä
1 ä
:>

Yes it is. That is one of the two problems I see.
The solution for this is to unicode(<string>, <coding>) each string.


I'd like to have my python programs unicode enabled.




:> python -c 'for a in "ä", unicode("ä"): print len(a), a'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

It seems that the default encoding is "ascii", so unicode() cannot cope
with "ä".
If I specify "utf8" for the encoding, unicode() works.

:> python -c 'for a in "ä", unicode("ä", "utf8"): print len(a), a'
2 ä
1 ä

:>


But the print statement yelds an UnicodeEncodeError
if I pipe the output to a program or a file.

:> python -c 'for a in "ä", unicode("ä", "utf8"): print len(a), a' | cat
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in
position 0: ordinal not in range(128)
2 ä
1 :>


So it seems to me, that piping the output changes the behavior of the
print statement:

:> python -c 'for a in "ä", unicode("ä", "utf8", "ignore"): print a,
len(a), type(a)'
ä 2 <type 'str'>
ä 1 <type 'unicode'>

:> python -c 'for a in "ä", unicode("ä", "utf8", "ignore"): print a,
len(a), type(a)'  | cat
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in
position 0: ordinal not in range(128)
ä 2 <type 'str'>
:>




How can I achieve that my python programs are unicode enabled:
- Input strings can have different encodings (mostly ascii, latin_1 or utf8)
- My python programs should always output "utf8".

Is that a good idea??


OK, the issue here is your use of -c, rather than an actual source file.
I don't know how to make -c take the magic initial encoding line.
If you rely on ascii source, you are safe, but have to write things like
     ms = u'That would be na\u00EFve'
 or  ms = u'That would be na\xEFve.'
 or  ms = u'That would be na\N{LATIN SMALL LETTER I WITH DIAERESIS}ve.'

If you do put an encoding line in your source (first or second line):
     # -*- coding: utf-8 -*-
 or  # -*- coding: iso-8859-1 -*-
 or  # -*- coding: latin-1 -*-

you can (later in that file) simply use:
     ms = u'That would be naïve.'

That is, I would avoid non-ascii source for plain strings in 2.X unless
you have a _very_ good reason; use it, instead, for unicode strings.

--Scott David Daniels
[email protected]
--
http://mail.python.org/mailman/listinfo/python-list

Re: string processing question

Reply via email to