Mike Dee wrote:
If I have this in the beginning of my Python script in Linux:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

should I - or should I not - be able to use non-ASCII characters in strings and in Tk GUI button labels and GUI window titles and in raw_input data without Python returning wrong case in manipulated strings and/or gibberished characters in Tk GUI title?

If you use byte strings, you should expect moji-bake. The coding declaration primarily affects Unicode literals, and has little effect on byte string literals. So try putting a "u" in front of each such string.

With non-ASCII characters I mean ( ISO-8859-1 ??) stuff like the German / Swedish / Finnish / etc "umlauted" letter A (= a diaresis; that is an 'A' with two dots above it, or an O with two dots above.)

You explicitly requested that these characters are *not* ISO-8859-1, by saying that you want them as UTF-8. The LATIN CAPITAL LETTER A WITH DIAERESIS can be encoded in many different character sets, e.g. ISO-8859-15, windows1252, UTF-8, UTF-16, euc-jp, T.101, ...

In different encodings, different byte sequences are used to represent
the same character. If you pass a byte string to Tk, it does not know
which encoding you meant to use (this is known in the Python source,
but lost on the way to Tk). So it guesses ISO-8859-1; this guess is
wrong because it really is UTF-8 in your case.

OTOH, if you use a Unicode string, it is very clear what internal
representation each character has.

How would you go about making a script where a) the user types in any text (that might or might not include umlauted characters) and b) that text then gets uppercased, lowercased or "titled" and c) printed?

Use Unicode.

Isn't it enough to have that 'UTF-8 encoding declaration' in the beginning,
and then just like get the user's raw_input, mangle it about with .title() or some such tool, and then just spit it out with a print statement?

No.

One can hardly expect the users to type characters like unicode('\xc3\
xa4\xc3\xb6\xc3\xbc', 'utf-8')u'\xe4\xf6\xfc' u"äöü".encode('utf-8') or whatnot, and encode & decode to and fro till the cows come home just to get a letter or two in their name to show up correctly.

This is not necessary.

Am I beyond hope?

Perhaps not. You should, however, familiarize yourself with the notion of character encodings, and how the same character can have different byte represenations, and the same byte representation can have different interpretations as a character. If libraries disagree on how to interpret bytes as characters, you get moji-bake (ghost characters; a Japanese term for the problem, as Japanese users are familiar with the problem for a long time)

The Python Unicode type solves these problems for good, but you
need to use it correctly.

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to