Re: unicode by default
Terry Reedy wrote: Is there a unix linux package that can be installed that drops at least 'one' default standard font that will be able to render all or 'most' (whatever I mean by that) code points in unicode? Is this a Python issue at all? Easy, practical use of unicode is still a work in progress. Apparently... the good news for me is that SBL provides their unicode font here: http://www.sbl-site.org/educational/biblicalfonts.aspx I'm getting much closer here, but now the problem is typing. The pain with unicode fonts is that the glyph is tied to the code point for the represented character, and not tied to any code point that matches any keyboard scan code for typing. :-} So, I can now see the ancient text with accents and aparatus in all of my editors, but I still cannot type any ancient Greek with my keyboard... because I have to make up a keymap first. sigh I don't find that SBL (nor Logos Software) has provided keymaps as yet... rats. I can read the test with Python though... ye. m harris -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
On Fri, 13 May 2011 14:53:50 -0500, harrismh777 wrote: The unicode consortium is very careful to make sure that thousands of symbols have a unique code point (that's great !) but how do these thousands of symbols actually get displayed if there is no font consortium? Are there collections of 'standard' fonts for unicode that I am not aware? Is there a unix linux package that can be installed that drops at least 'one' default standard font that will be able to render all or 'most' (whatever I mean by that) code points in unicode? Using the original meaning of font (US) or fount (commonwealth), you can't have a single font cover the whole of Unicode. A font isn't a random set of glyphs, but a set of glyphs in a common style, which can only practically be achieved for a specific alphabet. You can bundle multiple fonts covering multiple repertoires into a single TTF (etc) file, but there's not much point. In software, the term font is commonly used to refer to some ad-hoc mapping between codepoints and glyphs. This typically works by either associating each specific font with a specific repertoire (set of codepoints), or by simply trying each font in order until one is found with the correct glyph. This is a sufficiently common problem that the FontConfig library exists to simplify a large part of it. Is this a Python issue at all? No. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
On 14 mai, 09:41, harrismh777 harrismh...@charter.net wrote: ... I'm getting much closer here, ... You should really understand, that Unicode is a domain per se. It is independent from any os's, programming languages or applications. It is up to these tools to be unicode compliant. Working in a full unicode mode (at least for texts) is today practically a solved problem. But you have to ensure the whole toolchain is unicode compliant (editors, fonts (OpenType technology), rendering devices, ...). Tip. This list is certainly not the best place to grab informations. I suggest you start by getting informations about XeTeX. XeTeX is the new TeX engine working only in a unicode mode. From this starting point, you will fall on plenty web sites speaking about the unicode world, tools, fonts, ... A variant is to visit sites speaking about *typography*. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
On 5/14/2011 3:41 AM, harrismh777 wrote: Terry Reedy wrote: Easy, practical use of unicode is still a work in progress. Apparently... the good news for me is that SBL provides their unicode font here: http://www.sbl-site.org/educational/biblicalfonts.aspx I'm getting much closer here, but now the problem is typing. The pain with unicode fonts is that the glyph is tied to the code point for the represented character, and not tied to any code point that matches any keyboard scan code for typing. :-} So, I can now see the ancient text with accents and aparatus in all of my editors, but I still cannot type any ancient Greek with my keyboard... because I have to make up a keymap first. sigh I don't find that SBL (nor Logos Software) has provided keymaps as yet... rats. You need what is called, at least with Windows, an IME -- Input Method Editor. These are part of (or associated with) the OS, so they can be used with *any* application that will accept unicode chars (in whatever encoding) rather than just ascii chars. Windows has about a hundred or so, including Greek. I do not know if that includes classical Greek with the extra marks. I can read the test with Python though... ye. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
Terry Reedy tjre...@udel.edu writes: You need what is called, at least with Windows, an IME -- Input Method Editor. For a GNOME or KDE environment you want an input method framework; I recommend IBus URL:http://code.google.com/p/ibus/ which comes with the major GNU+Linux operating systems URL:http://oswatershed.org/pkg/ibus URL:http://packages.debian.org/squeeze/ibus . Then you have a wide range of input methods available. Many of them are specific to local writing systems. For writing special characters in English text, I use either ‘rfc1345’ or ‘latex’ within IBus. That allows special characters to be typed into any program which communicates with the desktop environment's input routines. Yay, unified input of special characters! Except Emacs :-( which fortunately has ‘ibus-el’ available to work with IBus URL:http://www.emacswiki.org/emacs/IBusMode :-). -- \ 己所不欲、勿施于人。| `\(What is undesirable to you, do not do to others.) | _o__) —孔夫子 Confucius, 551 BCE – 479 BCE | Ben Finney -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
On 12 mai, 18:17, Ian Kelly ian.g.ke...@gmail.com wrote: ... to worry about encodings are when you're encoding unicode characters to byte strings, or decoding bytes to unicode characters A small but important correction/clarification: In Unicode, unicode does not encode a *character*. It encodes a *code point*, a number, the integer associated to the character. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
jmfauth wrote: to worry about encodings are when you're encoding unicode characters to byte strings, or decoding bytes to unicode characters A small but important correction/clarification: In Unicode, unicode does not encode a*character*. It encodes a*code point*, a number, the integer associated to the character. That is a huge code-point... pun intended. ... and there is another point that I continue to be somewhat puzzled about, and that is the issue of fonts. On of my hobbies at the moment is ancient Greek (biblical studies, Septuaginta LXX, and Greek New Testament). I have these texts on my computer in a folder in several formats... pdf, unicode 'plaintext', osis.xml, and XML. These texts may be found at http://sblgnt.com I am interested for the moment only in the 'plaintext' stream, because it is unicode. ( first, in unicode, according to all the doc there is no such thing as 'plaintext,' so keep that in mind). When I open the text stream in one of my unicode editors I can see 'most' of the characters in a rudimentary Greek font with accents; however, I also see many tiny square blocks indicating (I think) that the code points do *not* have a corresponding character in my unicode font for that Greek symbol (whatever it is supposed to be). The point, or question is, how does one go about making sure that there is a corresponding font glyph to match a specific unicode code point for display in a particular terminal (editor, browser, whatever) ? The unicode consortium is very careful to make sure that thousands of symbols have a unique code point (that's great !) but how do these thousands of symbols actually get displayed if there is no font consortium? Are there collections of 'standard' fonts for unicode that I am not aware? Is there a unix linux package that can be installed that drops at least 'one' default standard font that will be able to render all or 'most' (whatever I mean by that) code points in unicode? Is this a Python issue at all? kind regards, m harris -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
On 5/13/11 2:53 PM, harrismh777 wrote: The unicode consortium is very careful to make sure that thousands of symbols have a unique code point (that's great !) but how do these thousands of symbols actually get displayed if there is no font consortium? Are there collections of 'standard' fonts for unicode that I am not aware? There are some well-known fonts that try to cover a large section of the Unicode standard. http://en.wikipedia.org/wiki/Unicode_typeface Is there a unix linux package that can be installed that drops at least 'one' default standard font that will be able to render all or 'most' (whatever I mean by that) code points in unicode? Is this a Python issue at all? Not really. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
On 5/13/2011 3:53 PM, harrismh777 wrote: The unicode consortium is very careful to make sure that thousands of symbols have a unique code point (that's great !) but how do these thousands of symbols actually get displayed if there is no font consortium? Are there collections of 'standard' fonts for unicode that I am not aware? Is there a unix linux package that can be installed that drops at least 'one' default standard font that will be able to render all or 'most' (whatever I mean by that) code points in unicode? Is this a Python issue at all? Easy, practical use of unicode is still a work in progress. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
John Machin wrote: On Thu, May 12, 2011 2:14 pm, Benjamin Kaplan wrote: If the file you're writing to doesn't specify an encoding, Python will default to locale.getdefaultencoding(), No such attribute. Perhaps you mean locale.getpreferredencoding() import locale locale.getpreferredencoding() 'UTF-8' Ye! :) -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
Ben Finney wrote: I'd phrase that as: * Text is a sequence of characters. Most inputs to the program, including files, sockets, etc., contain a sequence of bytes. * Always know whether you're dealing with text or with bytes. No object can be both. * In Python 2, ‘str’ is the type for a sequence of bytes. ‘unicode’ is the type for text. * In Python 3, ‘str’ is the type for text. ‘bytes’ is the type for a sequence of bytes. That is very helpful... thanks MRAB, Steve, John, Terry, Ben F, Ben K, Ian... ...thank you guys so much, I think I've got a better picture now of what is going on... this is also one place where I don't think the books are as clear as they need to be at least for me...(Lutz, Summerfield). So, the UTF-16 UTF-32 is INTERNAL only, for Python... and text in/out is based on locale... in my case UTF-8 ...that is enormously helpful for me... understanding locale on this system is as mystifying as unicode is in the first place. Well, after reading about unicode tonight (about four hours) I realize that its not really that hard... there's just a lot of details that have to come together. Straightening out that whole tower-of-babel thing is sure a pain in the butt. I also was not aware that UTF-8 chars could be up to six(6) byes long from left to right. I see now that the little-endianness I was ascribing to python is just a function of hexdump... and I was a little disappointed to find that hexdump does not support UTF-8, just ascii...doh. Anyway, thanks again... I've got enough now to play around a bit... PS thanks Steve for that link, informative and entertaining too... Joe says, If you are a programmer . . . and you don't know the basics of characters, character sets, encodings, and Unicode, and I catch you, I'm going to punish you by making you peel onions for 6 months in a submarine. I swear I will. :) kind regards, m harris -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
Terry Reedy wrote: It does not matter how Python stored the unicode internally. Does this help? Your intent is signalled by how you open the file. Very much, actually, thanks. I was missing the 'internal' piece, and did not realize that if I didn't specify the encoding on the open that python would pull the default encoding from locale... kind regards, m harris -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
On Thu, May 12, 2011 4:31 pm, harrismh777 wrote: So, the UTF-16 UTF-32 is INTERNAL only, for Python NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are encodings for the EXTERNAL representation of Unicode characters in byte streams. I also was not aware that UTF-8 chars could be up to six(6) byes long from left to right. It could be, once upon a time in ISO faerieland, when it was thought that Unicode could grow to 2**32 codepoints. However ISO and the Unicode consortium have agreed that 17 planes is the utter max, and accordingly a valid UTF-8 byte sequence can be no longer than 4 bytes ... see below chr(17 * 65536) Traceback (most recent call last): File stdin, line 1, in module ValueError: chr() arg not in range(0x11) chr(17 * 65536 - 1) '\U0010' _.encode('utf8') b'\xf4\x8f\xbf\xbf' b'\xf5\x8f\xbf\xbf'.decode('utf8') Traceback (most recent call last): File stdin, line 1, in module File C:\python32\lib\encodings\utf_8.py, line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xf5 in position 0: invalid start byte -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
John Machin wrote: On Thu, May 12, 2011 2:14 pm, Benjamin Kaplan wrote: If the file you're writing to doesn't specify an encoding, Python will default to locale.getdefaultencoding(), No such attribute. Perhaps you mean locale.getpreferredencoding() what about sys.getfilesystemencoding() In the event to distribuite a program how to guess which encoding will the user has? -- goto /dev/null -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
On Thu, May 12, 2011 at 1:58 AM, John Machin sjmac...@lexicon.net wrote: On Thu, May 12, 2011 4:31 pm, harrismh777 wrote: So, the UTF-16 UTF-32 is INTERNAL only, for Python NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are encodings for the EXTERNAL representation of Unicode characters in byte streams. Right. *Under the hood* Python uses UCS-2 (which is not exactly the same thing as UTF-16, by the way) to represent Unicode strings. However, this is entirely transparent. To the Python programmer, a unicode string is just an abstraction of a sequence of code-points. You don't need to think about UCS-2 at all. The only times you need to worry about encodings are when you're encoding unicode characters to byte strings, or decoding bytes to unicode characters, or opening a stream in text mode; and in those cases the only encoding that matters is the external one. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
On 5/12/2011 12:17 PM, Ian Kelly wrote: On Thu, May 12, 2011 at 1:58 AM, John Machinsjmac...@lexicon.net wrote: On Thu, May 12, 2011 4:31 pm, harrismh777 wrote: So, the UTF-16 UTF-32 is INTERNAL only, for Python NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are encodings for the EXTERNAL representation of Unicode characters in byte streams. Right. *Under the hood* Python uses UCS-2 (which is not exactly the same thing as UTF-16, by the way) to represent Unicode strings. I know some people say that, but according to the definitions of the unicode consortium, that is wrong! The earlier UCS-2 *cannot* represent chars in the Supplementary Planes. The later (1996) UTF-16, which Python uses, can. The standard considers 'UCS-2' obsolete long ago. See https://secure.wikimedia.org/wikipedia/en/wiki/UTF-16/UCS-2 or http://www.unicode.org/faq/basic_q.html#14 The latter says: Q: What is the difference between UCS-2 and UTF-16? A: UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before surrogate code points and UTF-16 were added to Version 2.0 of the standard. This term should now be avoided. It goes on: Sometimes in the past an implementation has been labeled UCS-2 to indicate that it does not support supplementary characters and doesn't interpret pairs of surrogate code points as characters. Such an implementation would not handle processing of character properties, code point boundaries, collation, etc. for supplementary characters. I know that 16-bit Python *does* use surrogate pairs for supplementary chars and at least some properties work for them. I am not sure exactly what the rest means. However, this is entirely transparent. To the Python programmer, a unicode string is just an abstraction of a sequence of code-points. You don't need to think about UCS-2 at all. The only times you need to worry about encodings are when you're encoding unicode characters to byte strings, or decoding bytes to unicode characters, or opening a stream in text mode; and in those cases the only encoding that matters is the external one. If one uses unicode chars in the Supplementary Planes above the BMP (the first 2**16), which require surrogate pairs for 16 bit unicode (UTF-16), then the abstraction leaks. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
On Thu, May 12, 2011 at 2:42 PM, Terry Reedy tjre...@udel.edu wrote: On 5/12/2011 12:17 PM, Ian Kelly wrote: Right. *Under the hood* Python uses UCS-2 (which is not exactly the same thing as UTF-16, by the way) to represent Unicode strings. I know some people say that, but according to the definitions of the unicode consortium, that is wrong! The earlier UCS-2 *cannot* represent chars in the Supplementary Planes. The later (1996) UTF-16, which Python uses, can. The standard considers 'UCS-2' obsolete long ago. See https://secure.wikimedia.org/wikipedia/en/wiki/UTF-16/UCS-2 or http://www.unicode.org/faq/basic_q.html#14 At the first link, in the section _Use in major operating systems and environments_ it states, The Python language environment officially only uses UCS-2 internally since version 2.1, but the UTF-8 decoder to Unicode produces correct UTF-16. Python can be compiled to use UCS-4 (UTF-32) but this is commonly only done on Unix systems. PEP 100 says: The internal format for Unicode objects should use a Python specific fixed format PythonUnicode implemented as 'unsigned short' (or another unsigned numeric type having 16 bits). Byte order is platform dependent. This format will hold UTF-16 encodings of the corresponding Unicode ordinals. The Python Unicode implementation will address these values as if they were UCS-2 values. UCS-2 and UTF-16 are the same for all currently defined Unicode character points. UTF-16 without surrogates provides access to about 64k characters and covers all characters in the Basic Multilingual Plane (BMP) of Unicode. It is the Codec's responsibility to ensure that the data they pass to the Unicode object constructor respects this assumption. The constructor does not check the data for Unicode compliance or use of surrogates. I'm getting out of my depth here, but that implies to me that while Python stores UTF-16 and can correctly encode/decode it to UTF-8, other codecs might only work correctly with UCS-2, and the unicode class itself ignores surrogate pairs. Although I'm not sure how much this might have changed since the original implementation, especially for Python 3. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
On Wed, May 11, 2011 at 3:37 PM, harrismh777 harrismh...@charter.net wrote: hi folks, I am puzzled by unicode generally, and within the context of python specifically. For one thing, what do we mean that unicode is used in python 3.x by default. (I know what default means, I mean, what changed?) The `unicode' class was renamed to `str', and a stripped-down version of the 2.X `str' class was renamed to `bytes'. I think part of my problem is that I'm spoiled (American, ascii heritage) and have been either stuck in ascii knowingly, or UTF-8 without knowing (just because the code points lined up). I am confused by the implications for using 3.x, because I am reading that there are significant things to be aware of... what? Mainly Python 3 no longer does explicit conversion between bytes and unicode, requiring the programmer to be explicit about such conversions. If you have Python 2 code that is sloppy about this, you may get some Unicode encode/decode errors when trying to run the same code in Python 3. The 2to3 tool can help somewhat with this, but it can't prevent all problems. On my installation 2.6 sys.maxunicode comes up with 1114111, and my 2.7 and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the default compile option for 2.7 3.2 (I didn't change anything) is set for UCS-2 (UTF-16) or 2 byte unicode(?). Do I understand this much correctly? I think that UCS-2 has always been the default unicode width for CPython, although the exact representation used internally is an implementation detail. The books say that the .py sources are UTF-8 by default... and that 3.x is either UCS-2 or UCS-4. If I use the file handling capabilities of Python in 3.x (by default) what encoding will be used, and how will that affect the output? If you open a file in binary mode, the result is a non-decoded byte stream. If you open a file in text mode and do not specify an encoding, then the result of locale.getpreferredencoding() is used for decoding, and the result is a unicode stream. If I do not specify any code points above ascii 0xFF does any of this matter anyway? You mean 0x7F, and probably, due to the need to explicitly encode and decode. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
On Wed, May 11, 2011 at 2:37 PM, harrismh777 harrismh...@charter.net wrote: hi folks, I am puzzled by unicode generally, and within the context of python specifically. For one thing, what do we mean that unicode is used in python 3.x by default. (I know what default means, I mean, what changed?) I think part of my problem is that I'm spoiled (American, ascii heritage) and have been either stuck in ascii knowingly, or UTF-8 without knowing (just because the code points lined up). I am confused by the implications for using 3.x, because I am reading that there are significant things to be aware of... what? On my installation 2.6 sys.maxunicode comes up with 1114111, and my 2.7 and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the default compile option for 2.7 3.2 (I didn't change anything) is set for UCS-2 (UTF-16) or 2 byte unicode(?). Do I understand this much correctly? Not really sure about that, but it doesn't matter anyway. Because even though internally the string is stored as either a UCS-2 or a UCS-4 string, you never see that. You just see this string as a sequence of characters. If you want to turn it into a sequence of bytes, you have to use an encoding. The books say that the .py sources are UTF-8 by default... and that 3.x is either UCS-2 or UCS-4. If I use the file handling capabilities of Python in 3.x (by default) what encoding will be used, and how will that affect the output? If I do not specify any code points above ascii 0xFF does any of this matter anyway? ASCII only goes up to 0x7F. If you were using UTF-8 bytestrings, then there is a difference for anything over that range. A byte string is a sequence of bytes. A unicode string is a sequence of these mythical abstractions called characters. So a unicode string u'\u00a0' will have a length of 1. Encode that to UTF-8 and you'll find it has a length of 2 (because UTF-8 uses 2 bytes to encode everything over 128- the top bit is used to signal that you need the next byte for this character) If you want the history behind the whole encoding mess, Joel Spolsky wrote a rather amusing article explaining how this all came about: http://www.joelonsoftware.com/articles/Unicode.html And the biggest reason to use Unicode is so that you don't have to worry about your program messing up because someone hands you input in a different encoding than you used. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
Ian Kelly wrote: Ian, Benjamin, thanks much. The `unicode' class was renamed to `str', and a stripped-down version of the 2.X `str' class was renamed to `bytes'. ... thank you, this is very helpful. If I do not specify any code points above ascii 0xFF does any of this matter anyway? You mean 0x7F, and probably, due to the need to explicitly encode and decode. Yes, actually, I did... and from Benjamin's reply it seems that this matters only if I am working with bytes. Is it true that if I am working without using bytes sequences that I will not need to care about the encoding anyway, unless of course I need to specify a unicode code point? Thanks again. kind regards, m harris -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
On Thu, May 12, 2011 8:51 am, harrismh777 wrote: Is it true that if I am working without using bytes sequences that I will not need to care about the encoding anyway, unless of course I need to specify a unicode code point? Quite the contrary. (1) You cannot work without using bytes sequences. Files are byte sequences. Web communication is in bytes. You need to (know / assume / be able to extract / guess) the input encoding. You need to encode your output using an encoding that is expected by the consumer (or use an output method that will do it for you). (2) You don't need to use bytes to specify a Unicode code point. Just use an escape sequence e.g. \u0404 is a Cyrillic character. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
John Machin wrote: (1) You cannot work without using bytes sequences. Files are byte sequences. Web communication is in bytes. You need to (know / assume / be able to extract / guess) the input encoding. You need to encode your output using an encoding that is expected by the consumer (or use an output method that will do it for you). (2) You don't need to use bytes to specify a Unicode code point. Just use an escape sequence e.g. \u0404 is a Cyrillic character. Thanks John. In reverse order, I understand point (2). I'm less clear on point (1). If I generate a string of characters that I presume to be ascii/utf-8 (no \u0404 type characters) and write them to a file (stdout) how does default encoding affect that file.by default..? I'm not seeing that there is anything unusual going on... If I open the file with vi? If I open the file with gedit? emacs? Another question... in mail I'm receiving many small blocks that look like sprites with four small hex codes, scattered about the mail... mostly punctuation, maybe? ... guessing, are these unicode code points, and if so what is the best way to 'guess' the encoding? ... is it coded in the stream somewhere...protocol? thanks -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
On 12/05/2011 02:22, harrismh777 wrote: John Machin wrote: (1) You cannot work without using bytes sequences. Files are byte sequences. Web communication is in bytes. You need to (know / assume / be able to extract / guess) the input encoding. You need to encode your output using an encoding that is expected by the consumer (or use an output method that will do it for you). (2) You don't need to use bytes to specify a Unicode code point. Just use an escape sequence e.g. \u0404 is a Cyrillic character. Thanks John. In reverse order, I understand point (2). I'm less clear on point (1). If I generate a string of characters that I presume to be ascii/utf-8 (no \u0404 type characters) and write them to a file (stdout) how does default encoding affect that file.by default..? I'm not seeing that there is anything unusual going on... If I open the file with vi? If I open the file with gedit? emacs? Another question... in mail I'm receiving many small blocks that look like sprites with four small hex codes, scattered about the mail... mostly punctuation, maybe? ... guessing, are these unicode code points, and if so what is the best way to 'guess' the encoding? ... is it coded in the stream somewhere...protocol? You need to understand the difference between characters and bytes. A string contains characters, a file contains bytes. The encoding specifies how a character is represented as bytes. For example: In the Latin-1 encoding, the character £ is represented by the byte 0xA3. In the UTF-8 encoding, the character £ is represented by the byte sequence 0xC2 0xA3. In the ASCII encoding, the character £ can't be represented at all. The advantage of UTF-8 is that it can represent _all_ Unicode characters (codepoints, actually) as byte sequences, and all those in the ASCII range are represented by the same single bytes which the original ASCII system used. Use the UTF-8 encoding unless you have to use a different one. A file contains only bytes, a socket handles only bytes. Which encoding you should use for characters is down to protocol. A system such as email, which can handle different encodings, should have a way of specifying the encoding, and perhaps also a default encoding. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
On Thu, 12 May 2011 03:31:18 +0100, MRAB wrote: Another question... in mail I'm receiving many small blocks that look like sprites with four small hex codes, scattered about the mail... mostly punctuation, maybe? ... guessing, are these unicode code points, and if so what is the best way to 'guess' the encoding? ... is it coded in the stream somewhere...protocol? You need to understand the difference between characters and bytes. http://www.joelonsoftware.com/articles/Unicode.html is also a good resource. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
Steven D'Aprano wrote: You need to understand the difference between characters and bytes. http://www.joelonsoftware.com/articles/Unicode.html is also a good resource. Thanks for being patient guys, here's what I've done: astr=pound sign asym= \u00A3 afile=open(myfile, mode='w') afile.write(astr + asym) 12 afile.close() When I edit myfile with vi I see the 'characters' : pound sign £ ... same with emacs, same with gedit ... When I hexdump myfile I see this: 000 6f70 6375 2064 6973 6e67 c220 00a3 This is *not* what I expected... well it is (little-endian) right up to the 'c2' and that is what is confusing me I did not open the file with an encoding of UTF-8... so I'm assuming UTF-16 by default (python3) so I was expecting a '00A3' little-endian as 'A300' but what I got instead was UTF-8 little-endian 'c2a3' See my problem?... when I open the file with emacs I see the character pound sign... same with gedit... they're all using UTF-8 by default. By default it looks like Python3 is writing output with UTF-8 as default... and I thought that by default Python3 was using either UTF-16 or UTF-32. So, I'm confused here... also, I used the character sequence \u00A3 which I thought was UTF-16... but Python3 changed my intent to 'c2a3' which is the normal UTF-8... Thanks again for your patience... I really do hate to be dense about this... but this is another area where I'm just beginning to dabble and I'd like to know soon what I'm doing... Thanks for the link Steve... I'm headed there now... kind regards, m harris -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
On Thu, May 12, 2011 11:22 am, harrismh777 wrote: John Machin wrote: (1) You cannot work without using bytes sequences. Files are byte sequences. Web communication is in bytes. You need to (know / assume / be able to extract / guess) the input encoding. You need to encode your output using an encoding that is expected by the consumer (or use an output method that will do it for you). (2) You don't need to use bytes to specify a Unicode code point. Just use an escape sequence e.g. \u0404 is a Cyrillic character. Thanks John. In reverse order, I understand point (2). I'm less clear on point (1). If I generate a string of characters that I presume to be ascii/utf-8 (no \u0404 type characters) and write them to a file (stdout) how does default encoding affect that file.by default..? I'm not seeing that there is anything unusual going on... About characters that I presume to be ascii/utf-8 (no \u0404 type characters): All Unicode characters (including U+0404) are encodable in bytes using UTF-8. The result of sys.stdout.write(unicode_characters) to a TERMINAL depends mostly on sys.stdout.encoding. This is likely to be UTF-8 on a linux/OSX/platform. On a typical American / Western European /[former] colonies Windows box, this is likely to be cp850 on a Command Prompt window, and cp1252 in IDLE. UTF-8: All Unicode characters are encodable in UTF-8. Only problem arises if the terminal can't render the character -- you'll get spaces or blobs or boxes with hex digits in them or nothing. Windows (Command Prompt window): only a small subset of characters can be encoded in e.g. cp850; anything else causes an exception. Windows (IDLE): ignores sys.stdout.encoding and renders the characters itself. Same outcome as *x/UTF-8 above. If you write directly (or sys.stdout is redirected) to a FILE, the default encoding is obtained by sys.getdefaultencoding() and is AFAIK ascii unless the machine's site.py has been fiddled with to make it UTF-8 or something else. If I open the file with vi? If I open the file with gedit? emacs? Any editor will have a default encoding; if that doesn't match the file encoding, you have a (hopefully obvious) problem if the editor doesn't detect the mismatch. Consult your editor's docs or HTFF1K. Another question... in mail I'm receiving many small blocks that look like sprites with four small hex codes, scattered about the mail... mostly punctuation, maybe? ... guessing, are these unicode code points, yes and if so what is the best way to 'guess' the encoding? google(chardet) or rummage through the mail headers (but 4 hex digits in a box are a symptom of inability to render, not necessarily caused by an incorrect decoding) ... is it coded in the stream somewhere...protocol? Should be. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
MRAB pyt...@mrabarnett.plus.com writes: You need to understand the difference between characters and bytes. Yep. Those who don't need to join us in the third millennium, and the resources pointed out in this thread are good to help that. A string contains characters, a file contains bytes. That's not true for Python 2. I'd phrase that as: * Text is a sequence of characters. Most inputs to the program, including files, sockets, etc., contain a sequence of bytes. * Always know whether you're dealing with text or with bytes. No object can be both. * In Python 2, ‘str’ is the type for a sequence of bytes. ‘unicode’ is the type for text. * In Python 3, ‘str’ is the type for text. ‘bytes’ is the type for a sequence of bytes. -- \ “I went to a garage sale. ‘How much for the garage?’ ‘It's not | `\for sale.’” —Steven Wright | _o__) | Ben Finney -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
On 5/11/2011 11:44 PM, harrismh777 wrote: Steven D'Aprano wrote: You need to understand the difference between characters and bytes. http://www.joelonsoftware.com/articles/Unicode.html is also a good resource. Thanks for being patient guys, here's what I've done: astr=pound sign asym= \u00A3 afile=open(myfile, mode='w') afile.write(astr + asym) 12 afile.close() When I edit myfile with vi I see the 'characters' : pound sign £ ... same with emacs, same with gedit ... When I hexdump myfile I see this: 000 6f70 6375 2064 6973 6e67 c220 00a3 This is *not* what I expected... well it is (little-endian) right up to the 'c2' and that is what is confusing me I did not open the file with an encoding of UTF-8... so I'm assuming UTF-16 by default (python3) so I was expecting a '00A3' little-endian as 'A300' but what I got instead was UTF-8 little-endian 'c2a3' See my problem?... when I open the file with emacs I see the character pound sign... same with gedit... they're all using UTF-8 by default. By default it looks like Python3 is writing output with UTF-8 as default... and I thought that by default Python3 was using either UTF-16 or UTF-32. So, I'm confused here... also, I used the character sequence \u00A3 which I thought was UTF-16... but Python3 changed my intent to 'c2a3' which is the normal UTF-8... If you open a file as binary (bytes), you must write bytes, and they are stored without transformation. If you open in text mode, you must write text (string as unicode in 3.2) and Python will encode to bytes using either some default or the encoding you specified in the open statement. It does not matter how Python stored the unicode internally. Does this help? Your intent is signalled by how you open the file. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
On Thu, May 12, 2011 1:44 pm, harrismh777 wrote: By default it looks like Python3 is writing output with UTF-8 as default... and I thought that by default Python3 was using either UTF-16 or UTF-32. So, I'm confused here... also, I used the character sequence \u00A3 which I thought was UTF-16... but Python3 changed my intent to 'c2a3' which is the normal UTF-8... Python uses either a 16-bit or a 32-bit INTERNAL representation of Unicode code points. Those NN bits have nothing to do with the UTF-NN encodings, which can be used to encode the codepoints as byte sequences for EXTERNAL purposes. In your case, UTF-8 has been used as it is the default encoding on your platform. -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
On Wed, May 11, 2011 at 8:44 PM, harrismh777 harrismh...@charter.net wrote: Steven D'Aprano wrote: You need to understand the difference between characters and bytes. http://www.joelonsoftware.com/articles/Unicode.html is also a good resource. Thanks for being patient guys, here's what I've done: astr=pound sign asym= \u00A3 afile=open(myfile, mode='w') afile.write(astr + asym) 12 afile.close() When I edit myfile with vi I see the 'characters' : pound sign £ ... same with emacs, same with gedit ... When I hexdump myfile I see this: 000 6f70 6375 2064 6973 6e67 c220 00a3 This is *not* what I expected... well it is (little-endian) right up to the 'c2' and that is what is confusing me I did not open the file with an encoding of UTF-8... so I'm assuming UTF-16 by default (python3) so I was expecting a '00A3' little-endian as 'A300' but what I got instead was UTF-8 little-endian 'c2a3' quick note here: UTF-8 doesn't have an endian-ness. It's always read from left to right, with the high bit telling you whether you need to continue or not. So it's always little endian. See my problem?... when I open the file with emacs I see the character pound sign... same with gedit... they're all using UTF-8 by default. By default it looks like Python3 is writing output with UTF-8 as default... and I thought that by default Python3 was using either UTF-16 or UTF-32. So, I'm confused here... also, I used the character sequence \u00A3 which I thought was UTF-16... but Python3 changed my intent to 'c2a3' which is the normal UTF-8... The fact that CPython uses UCS-2 or UCS-4 internally is an implementation detail and isn't actually part of the Python specification. As far as a Python program is concerned, a Unicode string is a list of character objects, not bytes. Much like any other object, a unicode character needs to be serialized before it can be written to a file. An encoding is a serialization function for characters. If the file you're writing to doesn't specify an encoding, Python will default to locale.getdefaultencoding(), which tries to get your system's preferred encoding from environment variables (in other words, the same source that emacs and gedit will use to get the default encoding). -- http://mail.python.org/mailman/listinfo/python-list
Re: unicode by default
On Thu, May 12, 2011 2:14 pm, Benjamin Kaplan wrote: If the file you're writing to doesn't specify an encoding, Python will default to locale.getdefaultencoding(), No such attribute. Perhaps you mean locale.getpreferredencoding() -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode again ... default codec ...
En Fri, 30 Oct 2009 13:40:14 -0300, zooko zoo...@gmail.com escribió: On Oct 20, 9:50 pm, Gabriel Genellina gagsl-...@yahoo.com.ar wrote: DON'T do that. Really. Changing the default encoding is a horrible, horrible hack and causes a lot of problems. I'm not convinced. I've read all of the posts and web pages and blog entries decrying this practice over the last several years, but as far as I can tell the actual harm that can result is limited (as long as you set it to utf-8) and the practical benefits are substantial. This is a pattern that I have no problem using: import sys reload(sys) sys.setdefaultencoding(utf-8) The reason this doesn't cause too much harm is that anything that would have worked with the original default encoding ('ascii') will also work with the new utf-8 default encoding. Wrong. Dictionaries may start behaving incorrectly, by example. Normally, two keys that compare equal cannot coexist in the same dictionary: py 1 == 1.0 True py d = {} py d[1] = '*' py d[1.0] '*' py d[1.0] = '$' py d {1: '$'} 1 and 1.0 are the same key, as far as the dictionary is concerned. For this to work, both keys must have the same hash: py hash(1) == hash(1.0) True Now, let's set the default encoding to utf-8: py import sys py reload(sys) module 'sys' (built-in) py sys.setdefaultencoding('utf-8') py x = u'á' py y = u'á'.encode('utf-8') py x u'\xe1' py y '\xc3\xa1' (same as y='á' if the source encoding is set to utf-8, but I don't want to depend on that). Just to be sure we're dealing with the right character: py import unicodedata py unicodedata.name(x) 'LATIN SMALL LETTER A WITH ACUTE' py unicodedata.name(y.decode('utf-8')) 'LATIN SMALL LETTER A WITH ACUTE' Now, we can see that both x and y are equal: py x == y True x is an accented a, y is the same thing encoded using the default encoding, both are equal. Fine. Now create a dictionary: py d = {} py d[x] = '*' py d[x] '*' py x in d True py y in d False# ??? py d[y] = 2 py d {u'\xe1': '*', '\xc3\xa1': 2} # Since x==y, one should expect a single entry in the dictionary - but we got two. That's because: py x == y True py hash(x) == hash(y) False and this must *not* happen according to http://docs.python.org/reference/datamodel.html#object.__hash__ The only required property is that objects which compare equal have the same hash value Considering that dictionaries in Python are used almost everywhere, breaking this basic asumption is a really bad problem. Of course, all of this applies to Python 2.x; in Python 3.0 the problem was solved differently; strings are unicode by default, and the default encoding IS utf-8. As far as I've seen from the aforementioned mailing list threads and blog posts and so on, the worst thing that has ever happened as a result of this technique is that something works for you but fails for someone else who doesn't have this stanza. (http://tarekziade.wordpress.com/2008/01/08/ syssetdefaultencoding-is-evil/ .) That's bad, but probably just including this stanza at the top of the file that you are sharing with that other person instead of doing it in a sitecustomize.py file will avoid that problem. And then you break all other libraries that the program is using, including the Python standard library, because the default encoding is a global setting. What if another library decides to use latin-1 as the default encoding, using the same trick? Latest one wins... You said the practical benefits are substantial but I, for myself, cannot see any benefit. Perhaps if you post your real problems, someone can find the solution. The right way is to fix your program to do the right thing, not to hide the bugs under the rug. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode again ... default codec ...
Gabriel Genellina gagsl-...@yahoo.com.ar writes: En Wed, 21 Oct 2009 06:24:55 -0300, Lele Gaifax l...@metapensiero.it escribió: Gabriel Genellina gagsl-...@yahoo.com.ar writes: nosetest should do nothing special. You should configure the environment so Python *knows* that your console understands utf-8. Once Python is aware of the *real* encoding your console is using, sys.stdout.encoding will be utf-8 automatically and your problem is solved. I don't know how to do that within virtualenv, but the answer certainly does NOT involve sys.setdefaultencoding() On Windows, a normal console window on my system uses cp850: D:\USERDATA\Gabrielchcp Tabla de códigos activa: 850 D:\USERDATA\Gabrielpython Python 2.6.3 (r263rc1:75186, Oct 2 2009, 20:40:30) [MSC v.1500 32 bit (Intel)] on win32 Type help, copyright, credits or license for more information. py import sys py sys.getdefaultencoding() 'ascii' py sys.stdout.encoding 'cp850' py u = uáñç py print u áñç This is the same on my virtualenv: $ python -c import sys; print sys.getdefaultencoding(), sys.stdout.encoding ascii UTF-8 $ python -c print u'\xe1\xf1\xe7' áñç But look at this: $ cat test.py # -*- coding: utf-8 -*- class TestAccents(object): u'\xe1\xf1\xe7' def test_simple(self): u'cioè' pass $ nosetests test.py . -- Ran 1 test in 0.002s OK $ nosetests -v test.py ERROR == Traceback (most recent call last): File /tmp/env/bin/nosetests, line 8, in module load_entry_point('nose==0.11.1', 'console_scripts', 'nosetests')() File /tmp/env/lib/python2.6/site-packages/nose-0.11.1-py2.6.egg/nose/core.py, line 113, in __init__ argv=argv, testRunner=testRunner, testLoader=testLoader) File /usr/lib/python2.6/unittest.py, line 817, in __init__ self.runTests() File /tmp/env/lib/python2.6/site-packages/nose-0.11.1-py2.6.egg/nose/core.py, line 192, in runTests result = self.testRunner.run(self.test) File /tmp/env/lib/python2.6/site-packages/nose-0.11.1-py2.6.egg/nose/core.py, line 63, in run result.printErrors() File /tmp/env/lib/python2.6/site-packages/nose-0.11.1-py2.6.egg/nose/result.py, line 81, in printErrors _TextTestResult.printErrors(self) File /usr/lib/python2.6/unittest.py, line 724, in printErrors self.printErrorList('ERROR', self.errors) File /usr/lib/python2.6/unittest.py, line 730, in printErrorList self.stream.writeln(%s: %s % (flavour,self.getDescription(test))) File /usr/lib/python2.6/unittest.py, line 665, in writeln if arg: self.write(arg) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 10: ordinal not in range(128) Who is the culprit here? The fact is, encodings are the real Y2k problem, and they are here to stay for a while! thank you, ciao, lele. -- nickname: Lele Gaifax| Quando vivrò di quello che ho pensato ieri real: Emanuele Gaifas| comincerò ad aver paura di chi mi copia. l...@nautilus.homeip.net | -- Fortunato Depero, 1929. -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode again ... default codec ...
En Thu, 22 Oct 2009 05:25:16 -0300, Lele Gaifax l...@metapensiero.it escribió: Gabriel Genellina gagsl-...@yahoo.com.ar writes: En Wed, 21 Oct 2009 06:24:55 -0300, Lele Gaifax l...@metapensiero.it escribió: Gabriel Genellina gagsl-...@yahoo.com.ar writes: nosetest should do nothing special. You should configure the environment so Python *knows* that your console understands utf-8. This is the same on my virtualenv: $ python -c import sys; print sys.getdefaultencoding(), sys.stdout.encoding ascii UTF-8 $ python -c print u'\xe1\xf1\xe7' áñç Good, so stdout's encoding isn't really the problem. But look at this: File /usr/lib/python2.6/unittest.py, line 730, in printErrorList self.stream.writeln(%s: %s % (flavour,self.getDescription(test))) File /usr/lib/python2.6/unittest.py, line 665, in writeln if arg: self.write(arg) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 10: ordinal not in range(128) Who is the culprit here? unittest, or ultimately, this bug: http://bugs.python.org/issue4947 This is not specific to nosetest; unittest in verbose mode fails in the same way. fix: add this method to the _WritelnDecorator class in unittest.py (near line 664): def write(self, arg): if isinstance(arg, unicode): arg = arg.encode(self.stream.encoding, replace) self.stream.write(arg) The fact is, encodings are the real Y2k problem, and they are here to stay for a while! Ok, but the idea is to solve the problem (or not let it happen in the first place!), not hide it under the rug :) -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode again ... default codec ...
Gabriel Genellina gagsl-...@yahoo.com.ar writes: En Thu, 22 Oct 2009 05:25:16 -0300, Lele Gaifax l...@metapensiero.it escribió: Who is the culprit here? unittest, or ultimately, this bug: http://bugs.python.org/issue4947 Thank you. In particular I found http://bugs.python.org/issue4947#msg87637 as the best fit, I think that may be what's happening here. fix: add this method to the _WritelnDecorator class in unittest.py (near line 664): def write(self, arg): if isinstance(arg, unicode): arg = arg.encode(self.stream.encoding, replace) self.stream.write(arg) Uhm, that's almost as dirty as my reload(), you must admit! :-) bye, lele. -- nickname: Lele Gaifax| Quando vivrò di quello che ho pensato ieri real: Emanuele Gaifas| comincerò ad aver paura di chi mi copia. l...@nautilus.homeip.net | -- Fortunato Depero, 1929. -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode again ... default codec ...
On Thu, Oct 22, 2009 at 13:59 +0200, Lele Gaifax wrote: Gabriel Genellina gagsl-...@yahoo.com.ar writes: unittest, or ultimately, this bug: http://bugs.python.org/issue4947 http://bugs.python.org/issue4947#msg87637 as the best fit, I think You might also want to have a look at: http://bugs.python.org/issue1293741 I hope this helps and that these bugs will be solved soon. Wolodja signature.asc Description: Digital signature -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode again ... default codec ...
Gabriel Genellina gagsl-...@yahoo.com.ar writes: DON'T do that. Really. Changing the default encoding is a horrible, horrible hack and causes a lot of problems. ... More reasons: http://tarekziade.wordpress.com/2008/01/08/syssetdefaultencoding-is-evil/ See also this recent thread in python-dev: http://comments.gmane.org/gmane.comp.python.devel/106134 This is a problem that appears quite often, against which I have yet to see a general workaround, or even a safe pattern. I must confess that most often I just give up and change the if 0: line in sitecustomize.py to enable a reasonable default... A week ago I met another incarnation of the problem that I finally solved by reloading the sys module, a very ugly way, don't tell me, and I really would like to know a better way of doing it. The case is simple enough: a unit test started failing miserably, with a really strange traceback, and a quick pdb session revealed that the culprit was nosetest, when it prints out the name of the test, using some variant of print testfunc.__doc__: since the latter happened to be a unicode string containing some accented letters, that piece of nosetest's code raised an encoding error, that went untrapped... I tried to understand the issue, until I found that I was inside a fresh new virtualenv with python 2.6 and the sitecustomize wasn't even there. So, even if my shell environ was UTF-8 (the system being a Ubuntu Jaunty), within that virtualenv Python's stdout encoding was 'ascii'. Rightly so, nosetest failed to encode the accented letters to that. I could just rephrase the test __doc__, or remove it, but to avoid future noise I decided to go with the deprecated reload(sys) trick, done as early as possible... damn, it's just a test suite after all! Is there a correct way of dealing with this? What should nosetest eventually do to initialize it's sys.output.encoding reflecting the system's settings? And on the user side, how could I otherwise fix it (I mean, without resorting to the reload())? Thank you, ciao, lele. -- nickname: Lele Gaifax| Quando vivrò di quello che ho pensato ieri real: Emanuele Gaifas| comincerò ad aver paura di chi mi copia. l...@nautilus.homeip.net | -- Fortunato Depero, 1929. -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode again ... default codec ...
En Wed, 21 Oct 2009 06:24:55 -0300, Lele Gaifax l...@metapensiero.it escribió: Gabriel Genellina gagsl-...@yahoo.com.ar writes: DON'T do that. Really. Changing the default encoding is a horrible, horrible hack and causes a lot of problems. ... More reasons: http://tarekziade.wordpress.com/2008/01/08/syssetdefaultencoding-is-evil/ See also this recent thread in python-dev: http://comments.gmane.org/gmane.comp.python.devel/106134 This is a problem that appears quite often, against which I have yet to see a general workaround, or even a safe pattern. I must confess that most often I just give up and change the if 0: line in sitecustomize.py to enable a reasonable default... A week ago I met another incarnation of the problem that I finally solved by reloading the sys module, a very ugly way, don't tell me, and I really would like to know a better way of doing it. The case is simple enough: a unit test started failing miserably, with a really strange traceback, and a quick pdb session revealed that the culprit was nosetest, when it prints out the name of the test, using some variant of print testfunc.__doc__: since the latter happened to be a unicode string containing some accented letters, that piece of nosetest's code raised an encoding error, that went untrapped... I tried to understand the issue, until I found that I was inside a fresh new virtualenv with python 2.6 and the sitecustomize wasn't even there. So, even if my shell environ was UTF-8 (the system being a Ubuntu Jaunty), within that virtualenv Python's stdout encoding was 'ascii'. Rightly so, nosetest failed to encode the accented letters to that. That seems to imply that in your normal environment you altered the default encoding to utf-8 -- if so: don't do that! I could just rephrase the test __doc__, or remove it, but to avoid future noise I decided to go with the deprecated reload(sys) trick, done as early as possible... damn, it's just a test suite after all! Is there a correct way of dealing with this? What should nosetest eventually do to initialize it's sys.output.encoding reflecting the system's settings? And on the user side, how could I otherwise fix it (I mean, without resorting to the reload())? nosetest should do nothing special. You should configure the environment so Python *knows* that your console understands utf-8. Once Python is aware of the *real* encoding your console is using, sys.stdout.encoding will be utf-8 automatically and your problem is solved. I don't know how to do that within virtualenv, but the answer certainly does NOT involve sys.setdefaultencoding() On Windows, a normal console window on my system uses cp850: D:\USERDATA\Gabrielchcp Tabla de códigos activa: 850 D:\USERDATA\Gabrielpython Python 2.6.3 (r263rc1:75186, Oct 2 2009, 20:40:30) [MSC v.1500 32 bit (Intel)] on win32 Type help, copyright, credits or license for more information. py import sys py sys.getdefaultencoding() 'ascii' py sys.stdout.encoding 'cp850' py u = uáñç py print u áñç py u u'\xe1\xf1\xe7' py u.encode(cp850) '\xa0\xa4\x87' py import unicodedata py unicodedata.name(u[0]) 'LATIN SMALL LETTER A WITH ACUTE' I opened another console, changed the code page to 1252 (the one used in Windows applications; `chcp 1252`) and invoked Python again: py import sys py sys.getdefaultencoding() 'ascii' py sys.stdout.encoding 'cp1252' py u = uáñç py print u áñç py u u'\xe1\xf1\xe7' py u.encode(cp1252) '\xe1\xf1\xe7' py import unicodedata py unicodedata.name(u[0]) 'LATIN SMALL LETTER A WITH ACUTE' As you can see, everything works fine without any need to change the default encoding... Just make sure Python *knows* which encoding is being used in the console on which it runs. On Ubuntu you may need to set the LANG environment variable. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode again ... default codec ...
En Tue, 20 Oct 2009 17:13:52 -0300, Stef Mientki stef.mien...@gmail.com escribió: Form the thread how to write a unicode string to a file ? and my specific situation: - reading data from Excel, Delphi and other Windows programs and unicode Python - using wxPython, which forces unicode - writing to Excel and other Windows programs almost all answers, directed to the following solution: - in the python program, turn every string as soon as possible into unicode - in Python all processing is done in unicode - at the end, translate unicode into the windows specific character set (if necessary) Yes. That's the way to go; if you follow the above guidelines when working with character data, you should not encounter big unicode problems. The above approach seems to work nicely, but manipulating heavily with string like objects it's a crime. It's impossible to change all my modules from strings to unicode at once, and it's very tempting to do it just the opposite : convert everything into strings ! Wide is the road to hell... # adding unicode string and windows strings, results in an error: my_u = u'my_u' my_w = 'my_w' + chr ( 246 ) x = my_s + my_u (I guess you meant my_w + my_u). Formally: x = my_w.decode('windows-1252') + my_u # [1] but why are you using a byte string in the first place? Why not: my_w = u'my_w' + u'ö' so you can compute my_w + my_u directly? # to correctly handle the above ( in my situation), I need to write the following code (which my code quite unreadable my_u = u'my_u' my_w = 'my_w' + chr ( 246 ) x = unicode ( my_s, 'windows-1252' ) + my_u # converting to strings gives much better readable code: my_u = u'my_u' my_w = 'my_w' + chr ( 246 ) x = my_s + str(my_u) But it's not the same thing, i.e., in the former case x is an unicode object, in the later x is a byte string. Also, str(my_u) only works if it contains just ascii characters. The counterpart of my code [1] above would be: x = my_w + my_u.encode('windows-1252') That is, you use some_unicode_object.encode(desired-encoding) to do the unicode-bytestring conversion, and some_string_object.decode(known-encoding) to convert in the opposite sense. until I found this website: http://diveintopython.org/xml_processing/unicode.html By settings the default encoding: I now can go to unicode much more elegant and almost fully automatically: (and I guess the writing to a file problem is also solved) # now the manipulations of strings and unicode works OK: my_u = u'my_u' my_w = 'my_w' + chr ( 246 ) x = my_s + my_u The only disadvantage is that you've to put a special named file into the Python directory !! So if someone knows a more elegant way to set the default codec, I would be much obliged. DON'T do that. Really. Changing the default encoding is a horrible, horrible hack and causes a lot of problems. 'Dive into Python' is a great book, but suggesting to alter the default character encoding is very, very bad advice: - site.py and sitecustomize.py contain *global* settings, affecting *all* users and *all* scripts running on that machine. Other users may get very angry at you when their own programs break or give incorrect results when run with a different encoding. - you must have administrative rights to alter those files. - you won't be able to distribute your code, since almost everyone else in the world won't be using *your* default encoding. - what if another library/package/application wants to set a different default encoding? - the default encoding for Python=3.0 is now 'utf-8' instead of 'ascii' More reasons: http://tarekziade.wordpress.com/2008/01/08/syssetdefaultencoding-is-evil/ See also this recent thread in python-dev: http://comments.gmane.org/gmane.comp.python.devel/106134 -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list