Re: string processing question

norseman Fri, 01 May 2009 16:59:55 -0700

Piet van Oostrum wrote:

Kurt Mueller <[email protected]> (KM) wrote:

KM> But from the command line python interprets the code
KM> as 'latin_1' I presume. That is why I have to convert
KM> the "ä" with unicode().
KM> Am I right?


There are a couple of stages:
1. Your terminal emulator interprets your keystrokes, encodes them in a
   sequence of bytes and passes them to the shell. How the characters
   are encodes depends on the encoding used in the terminal emulator. So
   for example when the terminal is set to utf-8, your "ä" is converted
   to two bytes: \xc3 and \xa4.

2. The shell passes these bytes to the python command.3. The python interpreter must interpret these bytes with some decoding.

   If you use them in a bytes string they are copied as such, so in the
   example above the string "ä" will consist of the 2 bytes '\xc3\xa4'.
   If your terminal encoding would have been iso-8859-1, the string
   would have had a single byte '\xe4'. If you use it in a unicode
   string the Python parser has to convert it to unicode. If there is an
   encoding declaration in the source than that is used. Of course it
   should be the same as the actual encoding used by the shell (or the
   editor when you have a script saved in a file) otherwise you have a
   problem. If there is no encoding declaration in the source Python has
   to guess. It appears that in Python 2.x the default is iso-8859-1 but
   in Python 3.x it will be utf-8. You should avoid making any
   assumptions about this default.
4. During runtime unicode characters that have to be printed, written to
   a file, passed as file names or arguments to other processes etc.
   have to be encoded again to a sequence of bytes. In this case Python
   refuses to guess. Also you can't use the same encoding as in step 3,
   because the program can run on a completely different system than
   were it was compiled to byte code. So if the (unicode) string isn't
   ASCII and no encoding is given you get an error. The encoding can be
   given explicitely, or depending on the context, by sys.stdout.encoding,

sys.getdefaultencoding or PYTHONIOENCODING (from 2.6 on).

Unfortunately there is no equivalent to PYTHONIOENCODING for the
interpretation of the source text, it only works on run-time.

Example:
python -c 'print len(u"ä")'
prints 2 on my system, because my terminal is utf-8 so the ä is passed
as 2 bytes (\xc3\xa4), but these are interpreted by Python 2.6.2 as two
iso-8859-1 bytes.

If I dopython -c 'print u"ä"' in my terminal I therefore get two characters: Ã¤

but if I do this in Emacs I get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-1: ordinal not in range(128)
because my Emacs doesn't pass the encoding of its terminal emulation.

However:
python -c '# -*- coding:utf-8 -*-
print len(u"ä")'
will correctly print 1.

===============================

Thank you. I knew there had to be something simpler than brute force.

I have missed seeing the explanations for:
    python -c '# -*- coding:utf-8 -*-

in the 2.5 docs. Where can I find these? (the python -c is for config,I presume?)


By the way - the however: python...\nprint... snippet bombs in 2.5.2
1st bomb:  looking for closing '    #so I add one and remove one below
2nd bomb:  bad syntax               # I play awhile and join EMACS
3rd bomb:   Non-ASCII character '\xe4' in file....no encoding declared..

Python flatly states it's not ASCII and quits. Python print refuses tohandle high bit set bytes in 2.5.2....

The thank you is for pointing out how it works. I can use sed to fix forfile listing purposes. (Python won't like them, but a second pass thrused can give me something python can use and the two names can go on aline on the cheat sheet.)


Barry, Kurt - do understand using sed to change the incoming names?

Put the python in a box and use the Linux mc, ls, sed and echo routinesto get the names into a form python can use while making the cheat sheetat the same time. Substitutions like a for ä will generally beacceptable. Yes or No? The cheat sheet can show the ä in the originalname because the OS functions allow it. I have no doubt there will besome exceptions. :(

Once the names are "ASCII" you can get the python out & put it to work.

Just to head off the comments that it's not .... whatever

ls -1 | cheater.scr | python_program.py    IS PURE UNIX

Unix is designed for this. Files from different parts of the world? Ifyou can see the name as something besides ????? make a cheeter for each'Page'. mc /path/to/dir/of/choice

           ls -1 >dummy
           highlight dummy
           F3
           F4    and read the hex
takes me longer to type it in here than to do it. (leading spaces)  :)

Today: 20090430


Steve

ps. Piet - thanks for including the version specifics. It makes a hugedifference in expectations and allowances.

--
http://mail.python.org/mailman/listinfo/python-list

Re: string processing question

Reply via email to