On Wed, May 11, 2011 at 2:37 PM, harrismh777 <harrismh...@charter.net> wrote: > hi folks, > I am puzzled by unicode generally, and within the context of python > specifically. For one thing, what do we mean that unicode is used in python > 3.x by default. (I know what default means, I mean, what changed?) > > I think part of my problem is that I'm spoiled (American, ascii heritage) > and have been either stuck in ascii knowingly, or UTF-8 without knowing > (just because the code points lined up). I am confused by the implications > for using 3.x, because I am reading that there are significant things to be > aware of... what? > > On my installation 2.6 sys.maxunicode comes up with 1114111, and my 2.7 > and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was > compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the > default compile option for 2.7 & 3.2 (I didn't change anything) is set for > UCS-2 (UTF-16) or 2 byte unicode(?). Do I understand this much correctly? >
Not really sure about that, but it doesn't matter anyway. Because even though internally the string is stored as either a UCS-2 or a UCS-4 string, you never see that. You just see this string as a sequence of characters. If you want to turn it into a sequence of bytes, you have to use an encoding. > The books say that the .py sources are UTF-8 by default... and that 3.x is > either UCS-2 or UCS-4. If I use the file handling capabilities of Python in > 3.x (by default) what encoding will be used, and how will that affect the > output? > > If I do not specify any code points above ascii 0xFF does any of this > matter anyway? ASCII only goes up to 0x7F. If you were using UTF-8 bytestrings, then there is a difference for anything over that range. A byte string is a sequence of bytes. A unicode string is a sequence of these mythical abstractions called characters. So a unicode string u'\u00a0' will have a length of 1. Encode that to UTF-8 and you'll find it has a length of 2 (because UTF-8 uses 2 bytes to encode everything over 128- the top bit is used to signal that you need the next byte for this character) If you want the history behind the whole encoding mess, Joel Spolsky wrote a rather amusing article explaining how this all came about: http://www.joelonsoftware.com/articles/Unicode.html And the biggest reason to use Unicode is so that you don't have to worry about your program messing up because someone hands you input in a different encoding than you used. -- http://mail.python.org/mailman/listinfo/python-list