Le dimanche 23 juin 2013 18:30:40 UTC+2, Steven D'Aprano a écrit : > On Sun, 23 Jun 2013 08:51:41 -0700, wxjmfauth wrote: > > > > > utf-8: how many bytes to hold an "a" in memory? one byte. > > > > > > flexible string representation: how many bytes to hold an "a" in memory? > > > One byte? No, two. (Funny, it consumes more memory to hold an ascii char > > > than ascii itself) > > > > Incorrect. Python strings have overhead because they are objects, so > > let's see the difference adding a single character makes: > > > > # Python 3.3, with the hated flexible string representation: > > py> sys.getsizeof('a'*100) - sys.getsizeof('a'*99) > > 1 > > > > # Python 3.2: > > py> sys.getsizeof('a'*100) - sys.getsizeof('a'*99) > > 4 > > > > > > How about a French é character? Of course, ASCII cannot store it *at > > all*, but let's see what Python can do: > > > > > > # The hated Python 3.3 again: > > py> sys.getsizeof('é'*100) - sys.getsizeof('é'*99) > > 1 > > > > > > # And Python 3.2: > > py> sys.getsizeof('é'*100) - sys.getsizeof('é'*99) > > 4 > > > > > > > > > utf-8: In a series of bytes implementing the encoded code points > > > supposed to hold a string, picking a byte and finding to which encoded > > > code point it belongs is a no prolem. > > > > Incorrect. UTF-8 is unsuitable for random access, since it has variable- > > width characters, anything from 1 to 4 bytes. So you cannot just jump > > directly to character 1000 in a block of text, you have to inspect each > > byte one-by-one to decide whether it is a 1, 2, 3 or 4 byte character. > > > > > > > flexible string representation: In a series of bytes implementing the > > > encoded code points supposed to hold a string, picking a byte and > > > finding to which encoded code point it belongs is ... impossible ! > > > > Incorrect. It is absolutely trivial. Each string is marked as either 1- > > byte, 2-byte or 4-byte. If it is a 1-byte string, then each byte is one > > character. If it is a 2-byte string, then it is just like Python 3.2 > > narrow build, and each two bytes is a character. If it is a 4-byte > > string, then it is just like Python 3.2 wide build, and each four bytes > > is a character. Within a single string, the number of bytes per character > > is fixed, and random access is easy and fast. > > > > > > > > -- > > Steven
:-) -- http://mail.python.org/mailman/listinfo/python-list