Mike Dee <[EMAIL PROTECTED]> wrote in message news:<[EMAIL PROTECTED]>... > A very very basic UTF-8 question that's driving me nuts: > > If I have this in the beginning of my Python script in Linux: > > #!/usr/bin/env python > # -*- coding: UTF-8 -*- > > should I - or should I not - be able to use non-ASCII characters > in strings
For string literals, with the "coding" declaration, Python will accept that the bytes sitting in your source file inside the string literal didn't get there by accident - ie. that you meant to put the bytes [0xC3, 0xA6, 0xC3, 0xB8, 0xC3, 0xA5] into a string when you entered "æøå" in a UTF-8-enabled text editor. (Apologies if you don't see the three Scandinavian characters properly there in your preferred newsreader.) For Unicode literals (eg. u"æøå" in that same UTF-8-enabled text editor), Python will not only accept the above but also use the "coding" declaration to produce a Unicode object which unambiguously represents the sequence of characters - ie. something that can be used/processed to expose the intended characters in your program at run-time without any confusion about which characters are being represented. > and in Tk GUI button labels and GUI window titles and in > raw_input data without Python returning wrong case in manipulated > strings and/or gibberished characters in Tk GUI title? This is the challenging part. Having just experimented with using both string literals and Unicode literals with Tkinter on a Fedora Core 3 system, with a program edited in a UTF-8 environment and with a ubiquitous UTF-8-based locale, it's actually possible to write non-ASCII characters into those literals and to get Tkinter to display them as I intended, but I've struck lucky with that particular combination - a not entirely unintended consequence of the Red Hat people going all out for UTF-8 everywhere (see below for more on that). Consider this snippet (with that "coding" declaration at the top): button1["text"] = "æøå" button2["text"] = u"æøå" In an environment with UTF-8-enabled editors, my program running in a UTF-8 locale, and with Tk supporting treating things as UTF-8 (I would imagine), I see what I intended. But what if I choose to edit my program in an editor employing a different encoding? Let's say I enter the program in an editor employing the mac-iceland encoding, even declaring it in the "coding" declaration at the top of the file. Running the program now yields a very strange label for the first button, but a correct label for the second one. What happens is that with a non-UTF-8 source file, running in a UTF-8 locale with the same Tk as before, the text for the first button consists of a sequence of bytes that Tk then interprets incorrectly (probably as ISO-8859-1 as a sort of failsafe mode when it doesn't think the text is encoded using UTF-8), whereas the text for the second button is encoded from the unambiguous Unicode representation and is not misunderstood by Tk. Now, one might argue that I should change the locale to suit the encoding of the text file, but it soon becomes very impractical to take this approach. Besides, I don't think mac-iceland (an admittedly bizarre example) forms part of a supported locale on the system I have access to. > With non-ASCII characters I mean ( ISO-8859-1 ??) stuff like the > German / Swedish / Finnish / etc "umlauted" letter A (= a diaresis; > that is an 'A' with two dots above it, or an O with two dots above.) > > In Linux in the Tk(?) GUI of my 'program' I get an uppercase "A" > with a tilde above - followed by a general currency symbol ['spider']. > That is, two wrong characters where a small umlauted letter "a" > should be. That sort of demonstrates that the bytes used to represent your character are produced by a UTF-8 encoding of that character. Sadly, Tk then chooses to interpret them as ISO-8859-1, I guess. One thing to verify is whether Tk is aware of anything other than ISO-8859-1 on your system; another thing is to use Unicode objects and literal to at least avoid the guessing games. > But in Windows XP exactly the *same* code (The initiating "#!/usr/bin > /env python" and all..) works just fine in the Tk GUI - non-ascii > characters showing just as they should. (The code in both cases is > without any u' prefixes in strings.) It's pretty much a rule with internationalised applications in Python that Unicode is the way to go, even if it seems hard at first. This means that you should use Unicode literals in your programs should the need arise - I can't say it does very often in my case. > I have UTF-8 set as the encoding of my Suse 9.2 / KDE localization, I > have saved my 'source code' in UTF-8 format and I have tried to read > *a lot* of information about Unicode and I have heard it said many > times that Python handles unicode very well -- so why can it be so > bl**dy difficult to get an umlauted (two-dotted) letter a to be > properly handled by Python 2.3? In Windows I have Python 2.4 - but the > following case-insanity applies for Windows-Python as well: Python does handle Unicode pretty well, but it's getting the data in and out of Python, combined with other components and their means of presenting/representing that data which is usually the problem. > For example, if I do this in my Linux konsole (no difference whether it > be in KDE Konsole window or the non-gui one via CTRL-ALT-F2): > > >>>aoumlautxyz="12xyz" # number 1 = umlauted a, number 2 = uml o > >>>print aoumlautxyz.(upper) > > then the resulting string is NOT all upper case - it is a lowercase > umlauted a, then a lowercase umlauted o then uppercase XYZ In this case, you're using normal strings which are effectively locale-dependent byte strings. I guess that what happens is that the byte sequence gets passed to the system's string processing routines which then fail to convert the non-ASCII characters to upper case according to their understanding of what the bytes actually mean in terms of being characters. I'll accept that consoles can be pretty nasty in exposing the right encoding to Python (search for "setlocale" in the comp.lang.python archives) and that without some trickery you won't get decent results. However, by employing Unicode objects and explicitly identifying the encoding of the input, you should get the results you are looking for: aoumlautxyz=unicode("äöxyz", "utf-8") This assumes that UTF-8 really is the encoding of the input. [...] > One can hardly expect the users to type characters like unicode('\xc3\ > xa4\xc3\xb6\xc3\xbc', 'utf-8')u'\xe4\xf6\xfc' u"äöü".encode('utf-8') or > whatnot, and encode & decode to and fro till the cows come home just to > get a letter or two in their name to show up correctly. With a "coding" declaration and Unicode literals, you won't even need to use the unicode constructor. Again, I must confess that much of the data I work with doesn't originate in the program code itself, so I rarely need to even think of this issue. > It's a shame that the Linux Cookbook, Learning Python 2nd ed, Absolute > beginners guide to Python, Running Linux, Linux in a Nutshell, Suse 9.2 > Pro manuals and the online documentation I have bumped into with Google > (like in unicode.org or python.org or even the Python Programming Faq > 1.3.9 / Unicode error) do not contain enough - or simple enough - > information for a Python/Linux newbie to get 'it'. One side-effect of the "big push" to UTF-8 amongst the Linux distribution vendors/maintainers is the evasion of issues such as filesystem encodings and "real" Unicode at the system level. In Python, when you have a Unicode object, you are dealing with idealised sequences of characters, whereas in many system and library APIs out there you either get back a sequence of anonymous bytes or a sequence of UTF-8 bytes that people are pretending is Unicode, right up until the point where someone recompiles the software to use UTF-16 instead, thus causing havoc. Anyone who has needed to expose filesystems created by Linux distributions before the UTF-8 "big push" to later distributions can attest to the fact that the "see no evil" brass monkey is wearing a T-shirt with "UTF-8" written on it. Paul -- http://mail.python.org/mailman/listinfo/python-list