Re: UTF-8 / German, Scandinavian letters - is it really this difficult?? Linux Windows XP

2005-02-23 Thread Paul Boddie
Serge Orlov [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED]...
 Paul Boddie wrote:
  Anyone who has needed to expose filesystems
  created by Linux distributions before the UTF-8 big push to later
  distributions can attest to the fact that the see no evil brass
  monkey is wearing a T-shirt with UTF-8 written on it.
 
 Unfortunately the monkey is painted in the air with a stick, so
 not everyone can see it. Python can't. Given a random linux system
 how can you tell if the monkey has pushed it already or not?

That's a good question. See this article for an example of the
frustration caused:

http://groups.google.no/groups?selm=b1npav%24cci%241%40slb6.atl.mindspring.netoutput=gplain

Paul
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 / German, Scandinavian letters - is it really this difficult?? Linux Windows XP

2005-02-22 Thread Serge Orlov
Mike Dee wrote:
 [snip wrestling with byte strings]

In addition to Martin reply I just want to add two notes:
1. Interactive console in python 2.3 has a bug that was fixed
in 2.4, so you can't enter unicode strings at the prompt:

C:\Python24python.exe
 a=u''
 a
u'\u0430\u0431\u0432'

C:\Python23python.exe
 a=u''
 a
u'\xa0\xa1\xa2'

in 2.3 you need to use decode method to get unicode strings:
 import sys
 a2=''.decode(sys.stdin.encoding)
 a2
u'\u0430\u0431\u0432'

2. Suse ships buggy build of python so title doesn't work
properly, see discussion http://tinyurl.com/4k3au

 print aoumlautxyz.title()
Xyz

You will need to call setlocale to help you:

 import locale
 locale.setlocale(locale.LC_ALL,'')
'en_US.utf-8'
 print aoumlautxyz.title()
xyz

  Serge.

--
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 / German, Scandinavian letters - is it really this difficult?? Linux Windows XP

2005-02-22 Thread Mariano Draghi
Hace mucho tiempo en una galaxia muy, muy lejana, Mike Dee escribió:
   A very very basic UTF-8 question that's driving me nuts:
If I have this in the beginning of my Python script in Linux:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
should I - or should I not - be able to use non-ASCII characters 
in strings and in Tk GUI button labels and GUI window titles and in 
raw_input data without Python returning wrong case in manipulated 
strings and/or gibberished characters in Tk GUI title? 
...
I'd recommend reading The Absolute Minimum Every Software Developer 
Absolutely, Positively Must Know About Unicode and Character Sets (No 
Excuses!), by Joel Spolsky:
- http://www.joelonsoftware.com/articles/Unicode.html

It's not based on Python (nor any other language either...), but I find 
it *very* useful.

Regards,
--
Mariano
--
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 / German, Scandinavian letters - is it really this difficult?? Linux Windows XP

2005-02-22 Thread Fuzzyman

Mike Dee wrote:
 A very very basic UTF-8 question that's driving me nuts:

 If I have this in the beginning of my Python script in Linux:

 #!/usr/bin/env python
 # -*- coding: UTF-8 -*-

 should I - or should I not - be able to use non-ASCII characters
 in strings and in Tk GUI button labels and GUI window titles and in
 raw_input data without Python returning wrong case in manipulated
 strings and/or gibberished characters in Tk GUI title?



[snip..]

Yet another reply... :-)

My understanding is that the encoding declaration (as above) only
applies to the source code - and will not make your string literals
into unicode objects, nor set the default encoding for the interpreter.


This will mean string literals in your source code will be encoded as
UTF8 - if you handle them with normal string operations you might get
funny results.


Regards,

Fuzzy
http://www.voidspace.org.uk/python/index.shtml

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 / German, Scandinavian letters - is it really this difficult?? Linux Windows XP

2005-02-22 Thread Max M
Fuzzyman wrote:
Mike Dee wrote:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

This will mean string literals in your source code will be encoded as
UTF8 - if you handle them with normal string operations you might get
funny results.
It means that you don't have to explicitely set the encoding on strings.
If your coding isn't set you must write:
ust = 'æøå'.decode('utf-8')
If it is set, you can just write:
ust = u'æøå'
And this string will automatically be utf-8 encoded:
st = 'æøå'
So you should be able to convert it to unicode without giving an encoding:
ust = unicode(st)
--
hilsen/regards Max M, Denmark
http://www.mxm.dk/
IT's Mad Science
--
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 / German, Scandinavian letters - is it really this difficult?? Linux Windows XP

2005-02-22 Thread Duncan Booth
Max M wrote:

 And this string will automatically be utf-8 encoded:
 
 st = 'æøå'
 
 So you should be able to convert it to unicode without giving an
 encoding: 
 
 ust = unicode(st)
 
No.

Strings have no knowledge of their encoding. As you describe the string 
will be utf-8 encoded, but you still have to tell it that when you decode 
it.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 / German, Scandinavian letters - is it really this difficult?? Linux Windows XP

2005-02-22 Thread Fuzzyman

Max M wrote:
 Fuzzyman wrote:
  Mike Dee wrote:

 #!/usr/bin/env python
 # -*- coding: UTF-8 -*-

  This will mean string literals in your source code will be encoded
as
  UTF8 - if you handle them with normal string operations you might
get
  funny results.

 It means that you don't have to explicitely set the encoding on
strings.

 If your coding isn't set you must write:

 ust = 'æøå'.decode('utf-8')


Which is now deprecated isn't it ? (including encoded string literals
in source without declaring an encoiding).

 If it is set, you can just write:

 ust = u'æøå'

 And this string will automatically be utf-8 encoded:

 st = 'æøå'

 So you should be able to convert it to unicode without giving an
encoding:

 ust = unicode(st)


So all your non unicode string literals will be utf-8 encoded. Normal
string operations will handle them with the default encoding, which is
likely to be something else. A likely source of confusion, unless you
handle everything as unicode.

But then I suppose if you have any non-ascii characters in your source
code you *must* be explicit about what encoding they are in, or you are
asking for trouble.

Regards,


Fuzzy
http://www.voidspace.org.uk/python/index.shtml

 --

 hilsen/regards Max M, Denmark
 
 http://www.mxm.dk/
 IT's Mad Science

--
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 / German, Scandinavian letters - is it really this difficult?? Linux Windows XP

2005-02-22 Thread Paul Boddie
Mike Dee [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED]...
 A very very basic UTF-8 question that's driving me nuts:
 
 If I have this in the beginning of my Python script in Linux:
 
 #!/usr/bin/env python
 # -*- coding: UTF-8 -*-
 
 should I - or should I not - be able to use non-ASCII characters 
 in strings

For string literals, with the coding declaration, Python will accept
that the bytes sitting in your source file inside the string literal
didn't get there by accident - ie. that you meant to put the bytes
[0xC3, 0xA6, 0xC3, 0xB8, 0xC3, 0xA5] into a string when you entered
æøå in a UTF-8-enabled text editor. (Apologies if you don't see the
three Scandinavian characters properly there in your preferred
newsreader.)

For Unicode literals (eg. uæøå in that same UTF-8-enabled text
editor), Python will not only accept the above but also use the
coding declaration to produce a Unicode object which unambiguously
represents the sequence of characters - ie. something that can be
used/processed to expose the intended characters in your program at
run-time without any confusion about which characters are being
represented.

and in Tk GUI button labels and GUI window titles and in 
 raw_input data without Python returning wrong case in manipulated 
 strings and/or gibberished characters in Tk GUI title?

This is the challenging part. Having just experimented with using both
string literals and Unicode literals with Tkinter on a Fedora Core 3
system, with a program edited in a UTF-8 environment and with a
ubiquitous UTF-8-based locale, it's actually possible to write
non-ASCII characters into those literals and to get Tkinter to display
them as I intended, but I've struck lucky with that particular
combination - a not entirely unintended consequence of the Red Hat
people going all out for UTF-8 everywhere (see below for more on
that).

Consider this snippet (with that coding declaration at the top):

  button1[text] = æøå
  button2[text] = uæøå

In an environment with UTF-8-enabled editors, my program running in a
UTF-8 locale, and with Tk supporting treating things as UTF-8 (I would
imagine), I see what I intended. But what if I choose to edit my
program in an editor employing a different encoding? Let's say I enter
the program in an editor employing the mac-iceland encoding, even
declaring it in the coding declaration at the top of the file.
Running the program now yields a very strange label for the first
button, but a correct label for the second one.

What happens is that with a non-UTF-8 source file, running in a UTF-8
locale with the same Tk as before, the text for the first button
consists of a sequence of bytes that Tk then interprets incorrectly
(probably as ISO-8859-1 as a sort of failsafe mode when it doesn't
think the text is encoded using UTF-8), whereas the text for the
second button is encoded from the unambiguous Unicode representation
and is not misunderstood by Tk.

Now, one might argue that I should change the locale to suit the
encoding of the text file, but it soon becomes very impractical to
take this approach. Besides, I don't think mac-iceland (an admittedly
bizarre example) forms part of a supported locale on the system I have
access to.

 With non-ASCII characters I mean ( ISO-8859-1 ??) stuff like the 
 German / Swedish / Finnish / etc umlauted letter A  (= a diaresis; 
 that is an 'A' with two dots above it, or an O with two dots above.)
 
 In Linux in the Tk(?) GUI of my 'program' I get an uppercase A 
 with a tilde above - followed by a general currency symbol ['spider']. 
 That is, two wrong characters where a small umlauted letter a 
 should be.

That sort of demonstrates that the bytes used to represent your
character are produced by a UTF-8 encoding of that character. Sadly,
Tk then chooses to interpret them as ISO-8859-1, I guess. One thing to
verify is whether Tk is aware of anything other than ISO-8859-1 on
your system; another thing is to use Unicode objects and literal to at
least avoid the guessing games.

 But in Windows XP exactly the *same* code (The initiating #!/usr/bin
 /env python and all..) works just fine in the Tk GUI - non-ascii 
 characters showing just as they should. (The code in both cases is 
 without any u' prefixes in strings.)

It's pretty much a rule with internationalised applications in Python
that Unicode is the way to go, even if it seems hard at first. This
means that you should use Unicode literals in your programs should the
need arise - I can't say it does very often in my case.

 I have UTF-8 set as the encoding of my Suse 9.2 / KDE localization, I 
 have saved my 'source code' in UTF-8 format and I have tried to read
 *a lot* of information about Unicode and I have heard it said many 
 times that Python handles unicode very well -- so why can it be so 
 bl**dy difficult to get an umlauted (two-dotted) letter a to be 
 properly handled by Python 2.3? In Windows I have Python 2.4 - but the 
 following case-insanity 

Re: UTF-8 / German, Scandinavian letters - is it really this difficult?? Linux Windows XP

2005-02-22 Thread Serge Orlov
Paul Boddie wrote:
 One side-effect of the big push to UTF-8 amongst the Linux
 distribution vendors/maintainers is the evasion of issues such as
 filesystem encodings and real Unicode at the system level. In
 Python, when you have a Unicode object, you are dealing with
 idealised
 sequences of characters, whereas in many system and library APIs out
 there you either get back a sequence of anonymous bytes or a sequence
 of UTF-8 bytes that people are pretending is Unicode, right up until
 the point where someone recompiles the software to use UTF-16
 instead,
 thus causing havoc. Anyone who has needed to expose filesystems
 created by Linux distributions before the UTF-8 big push to later
 distributions can attest to the fact that the see no evil brass
 monkey is wearing a T-shirt with UTF-8 written on it.

Unfortunately the monkey is painted in the air with a stick, so
not everyone can see it. Python can't. Given a random linux system
how can you tell if the monkey has pushed it already or not?

  Serge.

-- 
http://mail.python.org/mailman/listinfo/python-list


UTF-8 / German, Scandinavian letters - is it really this difficult?? Linux Windows XP

2005-02-21 Thread Mike Dee

   A very very basic UTF-8 question that's driving me nuts:

If I have this in the beginning of my Python script in Linux:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

should I - or should I not - be able to use non-ASCII characters 
in strings and in Tk GUI button labels and GUI window titles and in 
raw_input data without Python returning wrong case in manipulated 
strings and/or gibberished characters in Tk GUI title? 



With non-ASCII characters I mean ( ISO-8859-1 ??) stuff like the 
German / Swedish / Finnish / etc umlauted letter A  (= a diaresis; 
that is an 'A' with two dots above it, or an O with two dots above.)

In Linux in the Tk(?) GUI of my 'program' I get an uppercase A 
with a tilde above - followed by a general currency symbol ['spider']. 
That is, two wrong characters where a small umlauted letter a 
should be. 

But in Windows XP exactly the *same* code (The initiating #!/usr/bin
/env python and all..) works just fine in the Tk GUI - non-ascii 
characters showing just as they should. (The code in both cases is 
without any u' prefixes in strings.)



I have UTF-8 set as the encoding of my Suse 9.2 / KDE localization, I 
have saved my 'source code' in UTF-8 format and I have tried to read
*a lot* of information about Unicode and I have heard it said many 
times that Python handles unicode very well -- so why can it be so 
bl**dy difficult to get an umlauted (two-dotted) letter a to be 
properly handled by Python 2.3? In Windows I have Python 2.4 - but the 
following case-insanity applies for Windows-Python as well:

For example, if I do this in my Linux konsole (no difference whether it 
be in KDE Konsole window or the non-gui one via CTRL-ALT-F2):

aoumlautxyz=12xyz   # number 1 = umlauted a, number 2 = uml o 
print aoumlautxyz.(upper)

then the resulting string is NOT all upper case - it is a lowercase 
umlauted a, then a lowercase umlauted o then uppercase XYZ

And command:

 print aoumlautxyz.title()

..results in a string where a-umlaut, o-umlaut and yz are lowercase and 
only the Z in the middle is uppercase.  

this print aoumlautxyz.lower()  

.. prints o.k. 


Am I missing something very basic here? Earlier there was a difference in 
my results between running the scripts in the CTRL ALT F2-konsole and the 
KDE-one, but I think running unicode_start  installing an unicode console
font at some point of time ironed that one out.

If this is due to some strange library, could someone please give me a 
push to a spot where to read about fixing it? Or am I just too stupid, 
and that's it. (I bet that really is what it boils down to..)


rant

I cannot be the only (non-pro) person in Europe who might need to use non-
ASCII characters in GUI titles / button labels, in strings provided by the 
users of the software with raw_input (like person's name that begins with 
an umlauted letter or includes one or several of them) ..in comments, and 
so on.

How would you go about making a script where a) the user types in any text 
(that might or might not include umlauted characters) and b) that text then 
gets uppercased, lowercased or titled and c) printed? 

Isn't it enough to have that 'UTF-8 encoding declaration' in the beginning,
and then just like get the user's raw_input, mangle it about with .title() 
or some such tool, and then just spit it out with a print statement?

One can hardly expect the users to type characters like unicode('\xc3\
xa4\xc3\xb6\xc3\xbc', 'utf-8')u'\xe4\xf6\xfc' uäöü.encode('utf-8') or 
whatnot, and encode  decode to and fro till the cows come home just to 
get a letter or two in their name to show up correctly. 

It's a shame that the Linux Cookbook, Learning Python 2nd ed, Absolute 
beginners guide to Python, Running Linux, Linux in a Nutshell, Suse 9.2 
Pro manuals and the online documentation I have bumped into with Google
(like in unicode.org or python.org or even the Python Programming Faq 
1.3.9 / Unicode error) do not contain enough - or simple enough - 
information for a Python/Linux newbie to get 'it'.

For what it's worth, in Kmail my encoding iso ISO8859-1. I tried that 
coding one in my KDE and my Python scripts, earlier too, but it was 
no better; actually that was why I started this Unicode sh..  ..thing. 


Am I beyond hope?

/rant


Mike d 


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 / German, Scandinavian letters - is it really this difficult?? Linux Windows XP

2005-02-21 Thread Martin v. Löwis
Mike Dee wrote:
If I have this in the beginning of my Python script in Linux:
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
should I - or should I not - be able to use non-ASCII characters 
in strings and in Tk GUI button labels and GUI window titles and in 
raw_input data without Python returning wrong case in manipulated 
strings and/or gibberished characters in Tk GUI title? 
If you use byte strings, you should expect moji-bake. The coding
declaration primarily affects Unicode literals, and has little
effect on byte string literals. So try putting a u in front
of each such string.
With non-ASCII characters I mean ( ISO-8859-1 ??) stuff like the 
German / Swedish / Finnish / etc umlauted letter A  (= a diaresis; 
that is an 'A' with two dots above it, or an O with two dots above.)
You explicitly requested that these characters are *not* ISO-8859-1,
by saying that you want them as UTF-8. The LATIN CAPITAL LETTER A WITH
DIAERESIS can be encoded in many different character sets, e.g.
ISO-8859-15, windows1252, UTF-8, UTF-16, euc-jp, T.101, ...
In different encodings, different byte sequences are used to represent
the same character. If you pass a byte string to Tk, it does not know
which encoding you meant to use (this is known in the Python source,
but lost on the way to Tk). So it guesses ISO-8859-1; this guess is
wrong because it really is UTF-8 in your case.
OTOH, if you use a Unicode string, it is very clear what internal
representation each character has.
How would you go about making a script where a) the user types in any text 
(that might or might not include umlauted characters) and b) that text then 
gets uppercased, lowercased or titled and c) printed? 
Use Unicode.
Isn't it enough to have that 'UTF-8 encoding declaration' in the beginning,
and then just like get the user's raw_input, mangle it about with .title() 
or some such tool, and then just spit it out with a print statement?
No.
One can hardly expect the users to type characters like unicode('\xc3\
xa4\xc3\xb6\xc3\xbc', 'utf-8')u'\xe4\xf6\xfc' uäöü.encode('utf-8') or 
whatnot, and encode  decode to and fro till the cows come home just to 
get a letter or two in their name to show up correctly. 
This is not necessary.
Am I beyond hope?
Perhaps not. You should, however, familiarize yourself with the notion
of character encodings, and how the same character can have different
byte represenations, and the same byte representation can have different
interpretations as a character. If libraries disagree on how to
interpret bytes as characters, you get moji-bake (ghost characters;
a Japanese term for the problem, as Japanese users are familiar with
the problem for a long time)
The Python Unicode type solves these problems for good, but you
need to use it correctly.
Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: UTF-8 / German, Scandinavian letters - is it really this difficult?? Linux Windows XP

2005-02-21 Thread Do Re Mi chel La Si Do
Hi !

For test your system, please, read this url :
http://sivanataraja.free.fr/config/test.htm  (french)
http://www.alanwood.net/unicode/  (english)

And many explains : http://en.wikipedia.org/wiki/Unicode


And, to reassure you:  I had no problème for use unicode chars with Tk (my
tests on : French, German/Deutch, Cyrillic (russian), Grec).


@-salutations
-- 
Michel Claveau





-- 
http://mail.python.org/mailman/listinfo/python-list