On 2011-01-19, Antoine Pitrou <solip...@pitrou.net> wrote: > On Wed, 19 Jan 2011 16:03:11 +0000 (UTC) > Tim Harig <user...@ilthio.net> wrote: >> >> For many operations, it is just much faster and simpler to use a single >> character based container opposed to having to process an entire byte >> stream to determine individual letters from the bytes or to having >> adaptive size containers to store the data. > > You *have* to "process the entire byte stream" in order to determine > boundaries of individual letters from the bytes if you want to use a > "character based container", regardless of the exact representation.
Right, but I only have to do that once. After that, I can directly address any piece of the stream that I choose. If I leave the information as a simple UTF-8 stream, I would have to walk the stream again, I would have to walk through the the first byte of all the characters from the beginning to make sure that I was only counting multibyte characters once until I found the character that I actually wanted. Converting to a fixed byte representation (UTF-32/UCS-4) or separating all of the bytes for each UTF-8 into 6 byte containers both make it possible to simply index the letters by a constant size. You will note that Python does the former. UTF-32/UCS-4 conversion is definitly supperior if you are actually doing any major but it adds the complexity and overhead of requiring the bit twiddling to make the conversions (once in, once again out). Some programs don't really care enough about what the data actually contains to make it worth while. They just want to be able to use the characters as black boxes. > Once you do that it shouldn't be very costly to compute the actual code > points. So, "much faster" sounds a bit dubious to me; especially if you You could I suppose keep a separate list of pointers to each letter so that you could use the pointer list for indexing or keep a list of the character sizes so that you can add them and calculate the variable width index; but, that adds overhead as well. -- http://mail.python.org/mailman/listinfo/python-list