Piet van Oostrum wrote:
Kurt Mueller <m...@problemlos.ch> (KM) wrote:
KM> But from the command line python interprets the code
KM> as 'latin_1' I presume. That is why I have to convert
KM> the "ä" with unicode().
KM> Am I right?
There are a couple of stages:
1. Your terminal emulator interprets your keystrokes, encodes them in a
sequence of bytes and passes them to the shell. How the characters
are encodes depends on the encoding used in the terminal emulator. So
for example when the terminal is set to utf-8, your "ä" is converted
to two bytes: \xc3 and \xa4.
2. The shell passes these bytes to the python command.
3. The python interpreter must interpret these bytes with some decoding.
If you use them in a bytes string they are copied as such, so in the
example above the string "ä" will consist of the 2 bytes '\xc3\xa4'.
If your terminal encoding would have been iso-8859-1, the string
would have had a single byte '\xe4'. If you use it in a unicode
string the Python parser has to convert it to unicode. If there is an
encoding declaration in the source than that is used. Of course it
should be the same as the actual encoding used by the shell (or the
editor when you have a script saved in a file) otherwise you have a
problem. If there is no encoding declaration in the source Python has
to guess. It appears that in Python 2.x the default is iso-8859-1 but
in Python 3.x it will be utf-8. You should avoid making any
assumptions about this default.
4. During runtime unicode characters that have to be printed, written to
a file, passed as file names or arguments to other processes etc.
have to be encoded again to a sequence of bytes. In this case Python
refuses to guess. Also you can't use the same encoding as in step 3,
because the program can run on a completely different system than
were it was compiled to byte code. So if the (unicode) string isn't
ASCII and no encoding is given you get an error. The encoding can be
given explicitely, or depending on the context, by sys.stdout.encoding,
sys.getdefaultencoding or PYTHONIOENCODING (from 2.6 on).
Unfortunately there is no equivalent to PYTHONIOENCODING for the
interpretation of the source text, it only works on run-time.
Example:
python -c 'print len(u"ä")'
prints 2 on my system, because my terminal is utf-8 so the ä is passed
as 2 bytes (\xc3\xa4), but these are interpreted by Python 2.6.2 as two
iso-8859-1 bytes.
If I do
python -c 'print u"ä"' in my terminal I therefore get two characters: ä
but if I do this in Emacs I get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-1: ordinal not in range(128)
because my Emacs doesn't pass the encoding of its terminal emulation.
However:
python -c '# -*- coding:utf-8 -*-
print len(u"ä")'
will correctly print 1.
===============================
Thank you. I knew there had to be something simpler than brute force.
I have missed seeing the explanations for:
python -c '# -*- coding:utf-8 -*-
in the 2.5 docs. Where can I find these? (the python -c is for config,
I presume?)
By the way - the however: python...\nprint... snippet bombs in 2.5.2
1st bomb: looking for closing ' #so I add one and remove one below
2nd bomb: bad syntax # I play awhile and join EMACS
3rd bomb: Non-ASCII character '\xe4' in file....no encoding declared..
Python flatly states it's not ASCII and quits. Python print refuses to
handle high bit set bytes in 2.5.2....
The thank you is for pointing out how it works. I can use sed to fix for
file listing purposes. (Python won't like them, but a second pass thru
sed can give me something python can use and the two names can go on a
line on the cheat sheet.)
Barry, Kurt - do understand using sed to change the incoming names?
Put the python in a box and use the Linux mc, ls, sed and echo routines
to get the names into a form python can use while making the cheat sheet
at the same time. Substitutions like a for ä will generally be
acceptable. Yes or No? The cheat sheet can show the ä in the original
name because the OS functions allow it. I have no doubt there will be
some exceptions. :(
Once the names are "ASCII" you can get the python out & put it to work.
Just to head off the comments that it's not .... whatever
ls -1 | cheater.scr | python_program.py IS PURE UNIX
Unix is designed for this. Files from different parts of the world? If
you can see the name as something besides ????? make a cheeter for each
'Page'. mc /path/to/dir/of/choice
ls -1 >dummy
highlight dummy
F3
F4 and read the hex
takes me longer to type it in here than to do it. (leading spaces) :)
Today: 20090430
Steve
ps. Piet - thanks for including the version specifics. It makes a huge
difference in expectations and allowances.
--
http://mail.python.org/mailman/listinfo/python-list