On Mon, Oct 8, 2012 at 10:35 PM, boB Stepp <robertvst...@gmail.com> wrote: > > I am not up (yet) on the details of Unicode that Python 3 defaults to > for strings, but I believe I comprehend the general concept. Looking > at the string escape table of chapter 2 it appears that Unicode > characters can be either 16-bit or 32-bit. That must be a lot of > potential characters!
There are 1114112 possible codes (65536 codes/plane * 17 planes), but some are reserved, and only about 10% are assigned. Here's a list by category: http://www.fileformat.info/info/unicode/category/index.htm Python 3 lets you use any Unicode letter as an identifier, including letter modifiers ("Lm") and number letters ("Nl"). For example: >>> aꘌꘌb = True >>> aꘌꘌb True >>> Ⅰ, Ⅱ, Ⅲ, Ⅳ, Ⅴ = range(1, 6) >>> Ⅰ, Ⅱ, Ⅲ, Ⅳ, Ⅴ (1, 2, 3, 4, 5) A potential gotcha in Unicode is the design choice to have both [C]omposed and [D]ecomposed forms of characters. For example: >>> from unicodedata import name, normalize >>> s1 = "ü" >>> name(s1) 'LATIN SMALL LETTER U WITH DIAERESIS' >>> s2 = normalize("NFD", s1) >>> list(map(name, s2)) ['LATIN SMALL LETTER U', 'COMBINING DIAERESIS'] These combine as one glyph when printed: >>> print(s2) ü Different forms of the 'same' character won't compare as equal unless you first normalize them to the same form: >>> s1 == s2 False >>> normalize("NFC", s1) == normalize("NFC", s2) True > I don't see a mention of byte strings mentioned in the index of my > text. Are these just the ASCII character set? A bytes object (and its mutable cousin bytearray) is a sequence of numbers, each in the range of a byte (0-255). bytes literals start with b, such as b'spam' and can only use ASCII characters, as does the repr of bytes. Slicing returns a new bytes object, but an index or iteration returns integer values: >>> b'spam'[:3] b'spa' >>> b'spam'[0] 115 >>> list(b'spam') [115, 112, 97, 109] bytes have string methods as a convenience, such as find, split, and partition. They also have the method decode(), which uses a specified encoding such as "utf-8" to create a string from an encoded bytes sequence. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor