On Wed, Oct 10, 2012 at 9:23 PM, boB Stepp <robertvst...@gmail.com> wrote: > >> >>> aꘌꘌb = True >> >>> aꘌꘌb >> True >> >> >>> Ⅰ, Ⅱ, Ⅲ, Ⅳ, Ⅴ = range(1, 6) >> >>> Ⅰ, Ⅱ, Ⅲ, Ⅳ, Ⅴ >> (1, 2, 3, 4, 5) > > Is doing this considered good programming practice?
The examples were meant to highlight the absurdity of using letter modifiers and number letters in identifiers. I should have clearly stated that I think these names are bad. >> bytes have string methods as a convenience, such as find, split, and >> partition. They also have the method decode(), which uses a specified >> encoding such as "utf-8" to create a string from an encoded bytes >> sequence. > > What is the intended use of byte types? bytes objects are important for low-level data processing, such as file and socket I/O. The fundamental addressable value in a computer is a byte (at least for all common, modern computers). When you write a string to a file or socket, it has to be encoded as a sequence of bytes. For example, consider the character "𝟡" (MATHEMATICAL DOUBLE-STRUCK DIGIT NINE) with decimal code 120801 (0x1d71e in hexadecimal): >>> ord("𝟡") 120801 Three common ways to encode this character are as UTF-32, UTF-16, and UTF-8. The UTF-32 encoding is the UCS4 format used by strings in main memory on a "wide" build (Python 3.3 uses a more efficient scheme that uses 1, 2, or 4 bytes as required). >>> s.encode("utf-32") b'\xff\xfe\x00\x00\xe1\xd7\x01\x00' The "utf-32" string encoder also includes a byte order mark (BOM) in the first 4 bytes of the encoded sequence (0xfffe0000). The order of the BOM determines that this is a little-endian, 4-byte encoding. http://en.wikipedia.org/wiki/Endianness You can use int.from_bytes() to verify that b'\xe1\xd7\x01\x00' is the number 120801 stored as 4 bytes in little-endian order: >>> int.from_bytes(b'\xe1\xd7\x01\x00', 'little') 120801 or crunch the numbers in a generator expression: >>> sum(x * 256**i for i,x in enumerate(b'\xe1\xd7\x01\x00')) 120801 UTF-32 is an inefficient way to represent Unicode. Characters in the BMP, which are by far the most common, only require at most 2 bytes. UTF-16 uses 2 bytes for BMP codes, like the original UCS2, and a 4-byte surrogate-pair encoding for characters in the supplementary planes. Here's the character "𝟡" encoded as UTF-16: >>> list(map(hex, s.encode('utf-16'))) ['0xff', '0xfe', '0x35', '0xd8', '0xe1', '0xdf'] Again there's a BOM, 0xfffe, which describes the order and number of bytes per code (i.e. 2 bytes, little endian). The character itself is stored as the surrogate pair [0xd835, 0xdfe1]. You can read more about surrogate pair encoding in the UTF-16 Wikipedia article: http://en.wikipedia.org/wiki/UTF-16 A "narrow" build of Python uses UCS2 + surrogates. It's not quite UTF-16 since it doesn't treat a surrogate pair as a single character for iteration, string length, and indexing. Python 3.3 eliminates narrow builds. Another common encoding is UTF-8. This maps each code to 1-4 bytes, without requiring a BOM (though the 3-byte BOM 0xefbbbf can be used when saving to a file). Since ASCII is so common, and since on many systems backward compatibility with ASCII is required, UTF-8 includes ASCII as a subset. In other words, codes below 128 are stored unmodified as a single byte. Non-ASCII codes are encoded as 2-4 bytes. See the UTF-8 Wikipedia article for the details: http://en.wikipedia.org/wiki/UTF-8#Description The character "𝟡" requires 4 bytes in UTF-8: >>> s = "𝟡" >>> sb = s.encode("utf-8") >>> sb b'\xf0\x9d\x9f\xa1' >>> list(sb) [240, 157, 159, 161] If you iterate over the encoded bytestring, the numbers 240, 157, 159, and 161 -- taken separately -- have no special significance. Neither does the length of 4 tell you how many characters are in the bytestring. With a decoded string, in contrast, you know how many characters it has (assuming you've normalized to "NFC" format) and can iterate through the characters in a simple for loop. If your terminal/console uses UTF-8, you can write the UTF-8 encoded bytes directly to the stdout buffer: >>> sys.stdout.buffer.write(b'\xf0\x9d\x9f\xa1' + b'\n') 𝟡 5 This wrote 5 bytes: 4 bytes for the "𝟡" character, plus b'\n' for a newline. Strings in Python 2 In Python 2, str is a bytestring. Iterating over a 2.x str yields single-byte characters. However, these generally aren't 'characters' at all (this goes back to the C programming language "char" type), not unless you're working with a single-byte encoding such as ASCII or Latin-1. In Python 2, unicode is a separate type and unicode literals require a u prefix to distinguish them from bytestrings, just as bytes literals in Python 3 require a b prefix to distinguish them from strings. Python 2.6 and 2.7 alias str to the name "bytes", and they support the b prefix in literals. These were added to ease porting to Python 3, but bear in mind that it's still a classic bytestring, not a bytes object. For example, in 2.x you can use ord() with an item of a bytestring, such as ord(b"ABC"[0]), but this won't work in 3.x because b"ABC"[0] returns the integer 65. On the other hand, ord(b"A") does work in 3.x. Python 2.6 also added "__future__.unicode_literals" to make string literals default to unicode without having to use the u prefix. bytestrings then require the b prefix. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor