Re: UTF-8 question from Dive into Python 3
On Jan 19, 11:33 pm, Terry Reedy tjre...@udel.edu wrote: On 1/19/2011 1:02 PM, Tim Harig wrote: Right, but I only have to do that once. After that, I can directly address any piece of the stream that I choose. If I leave the information as a simple UTF-8 stream, I would have to walk the stream again, I would have to walk through the the first byte of all the characters from the beginning to make sure that I was only counting multibyte characters once until I found the character that I actually wanted. Converting to a fixed byte representation (UTF-32/UCS-4) or separating all of the bytes for each UTF-8 into 6 byte containers both make it possible to simply index the letters by a constant size. You will note that Python does the former. The idea of using a custom fixed-width padded version of a UTF-8 steams waw initially shocking to me, but I can imagine that there are specialized applications, which slice-and-dice uninterpreted segments, for which that is appropriate. However, it is not germane to the folly of prefixing standard UTF-8 steams with a 3-byte magic number, mislabelled a 'byte-order-mark, thus making them non-standard. Unicode Book, 5.2.0, Chapter 2, Section 14, Page 51 - Paragraphe *Unicode Signature*. -- http://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 question from Dive into Python 3
On 2011-01-19, Tim Roberts t...@probo.com wrote: Tim Harig user...@ilthio.net wrote: On 2011-01-17, carlo syseng...@gmail.com wrote: 2- If that were true, can you point me to some documentation about the math that, as Mark says, demonstrates this? It is true because UTF-8 is essentially an 8 bit encoding that resorts to the next bit once it exhausts the addressible space of the current byte it moves to the next one. Since the bytes are accessed and assessed sequentially, they must be in big-endian order. You were doing excellently up to that last phrase. Endianness only applies when you treat a series of bytes as a larger entity. That doesn't apply to UTF-8. None of the bytes is more significant than any other, so by definition it is neither big-endian or little-endian. It depends how you process it and it doesn't generally make much difference in Python. Accessing UTF-8 data from C can be much trickier if you use a multibyte type to store the data. In that case, if happen to be on a little-endian architecture, it may be necessary to remember that the data is not in the order that your processor expects it to be for numeric operations and comparisons. That is why the FAQ I linked to says yes to the fact that you can consider UTF-8 to always be in big-endian order. Essentially all byte based data is big-endian. -- http://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 question from Dive into Python 3
On Wed, 19 Jan 2011 11:34:53 + (UTC) Tim Harig user...@ilthio.net wrote: That is why the FAQ I linked to says yes to the fact that you can consider UTF-8 to always be in big-endian order. It certainly doesn't. Read better. Essentially all byte based data is big-endian. This is pure nonsense. -- http://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 question from Dive into Python 3
Considering you post contained no information or evidence for your negations, I shouldn't even bother responding. I will bite once. Hopefully next time your arguments will contain some pith. On 2011-01-19, Antoine Pitrou solip...@pitrou.net wrote: On Wed, 19 Jan 2011 11:34:53 + (UTC) Tim Harig user...@ilthio.net wrote: That is why the FAQ I linked to says yes to the fact that you can consider UTF-8 to always be in big-endian order. It certainly doesn't. Read better. - Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If - yes, then can I still assume the remaining UTF-8 bytes are in big-endian ^^ - order? ^^ - - A: Yes, UTF-8 can contain a BOM. However, it makes no difference as ^^^ - to the endianness of the byte stream. UTF-8 always has the same byte ^^ - order. An initial BOM is only used as a signature -- an indication that ^^ - an otherwise unmarked text file is in UTF-8. Note that some recipients of - UTF-8 encoded data do not expect a BOM. Where UTF-8 is used transparently - in 8-bit environments, the use of a BOM will interfere with any protocol - or file format that expects specific ASCII characters at the beginning, - such as the use of #! of at the beginning of Unix shell scripts. The question that was not addressed was whether you can consider UTF-8 to be little endian. I pointed out why you cannot always make that assumption in my previous post. UTF-8 has no apparent endianess if you only store it as a byte stream. It does however have a byte order. If you store it using multibytes (six bytes for all UTF-8 possibilites) , which is useful if you want to have one storage container for each letter as opposed to one for each byte(1), the bytes will still have the same order but you have interrupted its sole existance as a byte stream and have returned it to the underlying multibyte oriented representation. If you attempt any numeric or binary operations on what is now a multibyte sequence, the processor will interpret the data using its own endian rules. If your processor is big-endian, then you don't have any problems. The processor will interpret the data in the order that it is stored. If your processor is little endian, then it will effectively change the order of the bytes for its own evaluation. So, you can always assume a big-endian and things will work out correctly while you cannot always make the same assumption as little endian without potential issues. The same holds true for any byte stream data. That is why I say that byte streams are essentially big endian. It is all a matter of how you look at it. I prefer to look at all data as endian even if it doesn't create endian issues because it forces me to consider any endian issues that might arise. If none do, I haven't really lost anything. If you simply assume that any byte sequence cannot have endian issues you ignore the possibility that such issues might not arise. When an issue like the above does, you end up with a potential bug. (1) For unicode it is probably better to convert to characters to UTF-32/UCS-4 for internal processing; but, creating a container large enough to hold any length of UTF-8 character will work. -- http://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 question from Dive into Python 3
On Wed, 19 Jan 2011 14:00:13 + (UTC) Tim Harig user...@ilthio.net wrote: - Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If - yes, then can I still assume the remaining UTF-8 bytes are in big-endian ^^ - order? ^^ - - A: Yes, UTF-8 can contain a BOM. However, it makes no difference as ^^^ - to the endianness of the byte stream. UTF-8 always has the same byte ^^ - order. ^^ Which certainly doesn't mean that byte order can be called big endian for any recognized definition of the latter. Similarly, ASCII test has its own order which certainly can't be characterized as either little endian or big endian. UTF-8 has no apparent endianess if you only store it as a byte stream. It does however have a byte order. If you store it using multibytes (six bytes for all UTF-8 possibilites) , which is useful if you want to have one storage container for each letter as opposed to one for each byte(1) That's a ridiculous proposition. Why would you waste so much space? UTF-8 exists *precisely* so that you can save space with most scripts. If you are ready to use 4+ bytes per character, just use UTF-32 which has much nicer properties. Bottom line: you are not describing UTF-8, only your own foolish interpretation of it. UTF-8 does not have any endianness since it is a byte stream and does not care about machine words. Antoine. -- http://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 question from Dive into Python 3
On Jan 19, 9:00 am, Tim Harig user...@ilthio.net wrote: So, you can always assume a big-endian and things will work out correctly while you cannot always make the same assumption as little endian without potential issues. The same holds true for any byte stream data. You need to spend some serious time programming a serial port or other byte/bit-stream oriented interface, and then you'll realize the folly of your statement. That is why I say that byte streams are essentially big endian. It is all a matter of how you look at it. It is nothing of the sort. Some byte streams are in fact, little endian: when the bytes are combined into larger objects, the least- significant byte in the object comes first. A lot of industrial/ embedded stuff has byte streams with LSB leading in the sequence, CAN comes to mind as an example. The only way to know is for the standard describing the stream to tell you what to do. I prefer to look at all data as endian even if it doesn't create endian issues because it forces me to consider any endian issues that might arise. If none do, I haven't really lost anything. If you simply assume that any byte sequence cannot have endian issues you ignore the possibility that such issues might not arise. No, you must assume nothing unless you're told how to combine the bytes within a sequence into a larger element. Plus, not all byte streams support such operations! Some byte streams really are just a sequence of bytes and the bytes within the stream cannot be meaningfully combined into larger data types. If I give you a series of 8-bit (so 1 byte) samples from an analog-to-digital converter, tell me how to combine them into a 16, 32, or 64-bit integer. You cannot do it without altering the meaning of the samples; it is a completely non-nonsensical operation. Adam -- http://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 question from Dive into Python 3
On 2011-01-19, Adam Skutt ask...@gmail.com wrote: On Jan 19, 9:00 am, Tim Harig user...@ilthio.net wrote: That is why I say that byte streams are essentially big endian. It is all a matter of how you look at it. It is nothing of the sort. Some byte streams are in fact, little endian: when the bytes are combined into larger objects, the least- significant byte in the object comes first. A lot of industrial/ embedded stuff has byte streams with LSB leading in the sequence, CAN comes to mind as an example. You are correct. Point well made. -- http://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 question from Dive into Python 3
On 2011-01-19, Antoine Pitrou solip...@pitrou.net wrote: On Wed, 19 Jan 2011 14:00:13 + (UTC) Tim Harig user...@ilthio.net wrote: UTF-8 has no apparent endianess if you only store it as a byte stream. It does however have a byte order. If you store it using multibytes (six bytes for all UTF-8 possibilites) , which is useful if you want to have one storage container for each letter as opposed to one for each byte(1) That's a ridiculous proposition. Why would you waste so much space? Space is only one tradeoff. There are many others to consider. I have created data structures with much higher overhead than that because they happen to make the problem easier and significantly faster for the operations that I am performing on the data. For many operations, it is just much faster and simpler to use a single character based container opposed to having to process an entire byte stream to determine individual letters from the bytes or to having adaptive size containers to store the data. UTF-8 exists *precisely* so that you can save space with most scripts. UTF-8 has many reasons for existing. One of the biggest is that it is compatible for tools that were designed to process ASCII and other 8bit encodings. If you are ready to use 4+ bytes per character, just use UTF-32 which has much nicer properties. I already mentioned UTF-32/UCS-4 as a probable alternative; but, I might not want to have to worry about converting the encodings back and forth before and after processing them. That said, and more importantly, many variable length byte streams may not have alternate representations as unicode does. -- http://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 question from Dive into Python 3
On Wed, 19 Jan 2011 16:03:11 + (UTC) Tim Harig user...@ilthio.net wrote: For many operations, it is just much faster and simpler to use a single character based container opposed to having to process an entire byte stream to determine individual letters from the bytes or to having adaptive size containers to store the data. You *have* to process the entire byte stream in order to determine boundaries of individual letters from the bytes if you want to use a character based container, regardless of the exact representation. Once you do that it shouldn't be very costly to compute the actual code points. So, much faster sounds a bit dubious to me; especially if you factor in the cost of memory allocation, and the fact that a larger container will fit less easily in a data cache. That said, and more importantly, many variable length byte streams may not have alternate representations as unicode does. This whole thread is about UTF-8 (see title) so I'm not sure what kind of relevance this is supposed to have. -- http://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 question from Dive into Python 3
On 2011-01-19, Antoine Pitrou solip...@pitrou.net wrote: On Wed, 19 Jan 2011 16:03:11 + (UTC) Tim Harig user...@ilthio.net wrote: For many operations, it is just much faster and simpler to use a single character based container opposed to having to process an entire byte stream to determine individual letters from the bytes or to having adaptive size containers to store the data. You *have* to process the entire byte stream in order to determine boundaries of individual letters from the bytes if you want to use a character based container, regardless of the exact representation. Right, but I only have to do that once. After that, I can directly address any piece of the stream that I choose. If I leave the information as a simple UTF-8 stream, I would have to walk the stream again, I would have to walk through the the first byte of all the characters from the beginning to make sure that I was only counting multibyte characters once until I found the character that I actually wanted. Converting to a fixed byte representation (UTF-32/UCS-4) or separating all of the bytes for each UTF-8 into 6 byte containers both make it possible to simply index the letters by a constant size. You will note that Python does the former. UTF-32/UCS-4 conversion is definitly supperior if you are actually doing any major but it adds the complexity and overhead of requiring the bit twiddling to make the conversions (once in, once again out). Some programs don't really care enough about what the data actually contains to make it worth while. They just want to be able to use the characters as black boxes. Once you do that it shouldn't be very costly to compute the actual code points. So, much faster sounds a bit dubious to me; especially if you You could I suppose keep a separate list of pointers to each letter so that you could use the pointer list for indexing or keep a list of the character sizes so that you can add them and calculate the variable width index; but, that adds overhead as well. -- http://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 question from Dive into Python 3
On Wed, 19 Jan 2011 18:02:22 + (UTC) Tim Harig user...@ilthio.net wrote: On 2011-01-19, Antoine Pitrou solip...@pitrou.net wrote: On Wed, 19 Jan 2011 16:03:11 + (UTC) Tim Harig user...@ilthio.net wrote: For many operations, it is just much faster and simpler to use a single character based container opposed to having to process an entire byte stream to determine individual letters from the bytes or to having adaptive size containers to store the data. You *have* to process the entire byte stream in order to determine boundaries of individual letters from the bytes if you want to use a character based container, regardless of the exact representation. Right, but I only have to do that once. You only have to decode once as well. If I leave the information as a simple UTF-8 stream, That's not what we are talking about. We are talking about the supposed benefits of your 6-byte representation scheme versus proper decoding into fixed width code points. UTF-32/UCS-4 conversion is definitly supperior if you are actually doing any major but it adds the complexity and overhead of requiring the bit twiddling to make the conversions (once in, once again out). Bit twiddling is not something processors are particularly bad at. Actually, modern processors are much better at arithmetic and logic than at recovering from mispredicted branches, which seems to suggest that discovering boundaries probably eats most of the CPU cycles. Converting to a fixed byte representation (UTF-32/UCS-4) or separating all of the bytes for each UTF-8 into 6 byte containers both make it possible to simply index the letters by a constant size. You will note that Python does the former. Indeed, Python chose the wise option. Actually, I'd be curious of any real-world software which successfully chose your proposed approach. -- http://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 question from Dive into Python 3
On 2011-01-19, Antoine Pitrou solip...@pitrou.net wrote: On Wed, 19 Jan 2011 18:02:22 + (UTC) Tim Harig user...@ilthio.net wrote: Converting to a fixed byte representation (UTF-32/UCS-4) or separating all of the bytes for each UTF-8 into 6 byte containers both make it possible to simply index the letters by a constant size. You will note that Python does the former. Indeed, Python chose the wise option. Actually, I'd be curious of any real-world software which successfully chose your proposed approach. The point is basically the same. I created an example because it was simpler to follow for demonstration purposes then an actual UTF-8 conversion to any official multibyte format. You obviously have no other purpose then to be contrary, so we ended up following tangents. As soon as you start to convert to a multibyte format the endian issues occur. For UTF-8 on big endian hardware, this is anti-climactic because all of the bits are already stored in proper order. Little endian systems will probably convert to a native native endian format. If you choose to ignore that, that is your perogative. Have a nice day. -- http://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 question from Dive into Python 3
On Wed, 19 Jan 2011 19:18:49 + (UTC) Tim Harig user...@ilthio.net wrote: On 2011-01-19, Antoine Pitrou solip...@pitrou.net wrote: On Wed, 19 Jan 2011 18:02:22 + (UTC) Tim Harig user...@ilthio.net wrote: Converting to a fixed byte representation (UTF-32/UCS-4) or separating all of the bytes for each UTF-8 into 6 byte containers both make it possible to simply index the letters by a constant size. You will note that Python does the former. Indeed, Python chose the wise option. Actually, I'd be curious of any real-world software which successfully chose your proposed approach. The point is basically the same. I created an example because it was simpler to follow for demonstration purposes then an actual UTF-8 conversion to any official multibyte format. You obviously have no other purpose then to be contrary [...] Right. You were the one who jumped in and tried to lecture everyone on how UTF-8 was big-endian, and now you are abandoning the one esoteric argument you found in support of that. As soon as you start to convert to a multibyte format the endian issues occur. Ok. Good luck with your endian issues which don't exist. -- http://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 question from Dive into Python 3
On 1/19/2011 1:02 PM, Tim Harig wrote: Right, but I only have to do that once. After that, I can directly address any piece of the stream that I choose. If I leave the information as a simple UTF-8 stream, I would have to walk the stream again, I would have to walk through the the first byte of all the characters from the beginning to make sure that I was only counting multibyte characters once until I found the character that I actually wanted. Converting to a fixed byte representation (UTF-32/UCS-4) or separating all of the bytes for each UTF-8 into 6 byte containers both make it possible to simply index the letters by a constant size. You will note that Python does the former. The idea of using a custom fixed-width padded version of a UTF-8 steams waw initially shocking to me, but I can imagine that there are specialized applications, which slice-and-dice uninterpreted segments, for which that is appropriate. However, it is not germane to the folly of prefixing standard UTF-8 steams with a 3-byte magic number, mislabelled a 'byte-order-mark, thus making them non-standard. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 question from Dive into Python 3
On Jan 17, 2:19 pm, carlo syseng...@gmail.com wrote: Hi, recently I had to study *seriously* Unicode and encodings for one project in Python but I left with a couple of doubts arised after reading the unicode chapter of Dive into Python 3 book by Mark Pilgrim. 1- Mark says: Also (and you’ll have to trust me on this, because I’m not going to show you the math), due to the exact nature of the bit twiddling, there are no byte-ordering issues. A document encoded in UTF-8 uses the exact same stream of bytes on any computer. . . . 2- If that were true, can you point me to some documentation about the math that, as Mark says, demonstrates this? I believe Mark was referring to the bit-twiddling described in the Design section at http://en.wikipedia.org/wiki/UTF-8 . Raymond -- http://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 question from Dive into Python 3
Tim Harig user...@ilthio.net wrote: On 2011-01-17, carlo syseng...@gmail.com wrote: 2- If that were true, can you point me to some documentation about the math that, as Mark says, demonstrates this? It is true because UTF-8 is essentially an 8 bit encoding that resorts to the next bit once it exhausts the addressible space of the current byte it moves to the next one. Since the bytes are accessed and assessed sequentially, they must be in big-endian order. You were doing excellently up to that last phrase. Endianness only applies when you treat a series of bytes as a larger entity. That doesn't apply to UTF-8. None of the bytes is more significant than any other, so by definition it is neither big-endian or little-endian. -- Tim Roberts, t...@probo.com Providenza Boekelheide, Inc. -- http://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 question from Dive into Python 3
On 17.01.2011 23:19, carlo wrote: Is it true UTF-8 does not have any big-endian/little-endian issue because of its encoding method? And if it is true, why Mark (and everyone does) writes about UTF-8 with and without BOM some chapters later? What would be the BOM purpose then? Can't answer your other questions, but the UTF-8 BOM is simply a marker saying This is a UTF-8 text file, not an ASCII text file If I'm not wrong, this was a Microsoft invention and surely one of their brightest ideas. I really wish, that this had been done for ANSI some decades ago. Determining the encoding for text files is hard to impossible because such a mark was never introduced. -- http://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 question from Dive into Python 3
On 2011-01-17, carlo syseng...@gmail.com wrote: Is it true UTF-8 does not have any big-endian/little-endian issue because of its encoding method? And if it is true, why Mark (and everyone does) writes about UTF-8 with and without BOM some chapters later? What would be the BOM purpose then? Yes, it is true. The BOM simply identifies that the encoding as a UTF-8.: http://unicode.org/faq/utf_bom.html#bom5 2- If that were true, can you point me to some documentation about the math that, as Mark says, demonstrates this? It is true because UTF-8 is essentially an 8 bit encoding that resorts to the next bit once it exhausts the addressible space of the current byte it moves to the next one. Since the bytes are accessed and assessed sequentially, they must be in big-endian order. -- http://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 question from Dive into Python 3
On Mon, 17 Jan 2011 14:19:13 -0800 (PST) carlo syseng...@gmail.com wrote: Is it true UTF-8 does not have any big-endian/little-endian issue because of its encoding method? Yes. And if it is true, why Mark (and everyone does) writes about UTF-8 with and without BOM some chapters later? What would be the BOM purpose then? BOM in this case is a misnomer. For UTF-8, it is only used as a marker (a magic number, if you like) to signal than a given text file is UTF-8. The UTF-8 BOM does not say anything about byte order; and, actually, it does not change with endianness. (note that it is not required to put an UTF-8 BOM at the beginning of text files; it is just a hint that some tools use when generating/reading UTF-8) 2- If that were true, can you point me to some documentation about the math that, as Mark says, demonstrates this? Math? UTF-8 is simply a byte-oriented (rather than word-oriented) encoding. There is no math involved, it just works by construction. Regards Antoine. -- http://mail.python.org/mailman/listinfo/python-list
Re: UTF-8 question from Dive into Python 3
On 17 Gen, 23:34, Antoine Pitrou solip...@pitrou.net wrote: On Mon, 17 Jan 2011 14:19:13 -0800 (PST) carlo syseng...@gmail.com wrote: Is it true UTF-8 does not have any big-endian/little-endian issue because of its encoding method? Yes. And if it is true, why Mark (and everyone does) writes about UTF-8 with and without BOM some chapters later? What would be the BOM purpose then? BOM in this case is a misnomer. For UTF-8, it is only used as a marker (a magic number, if you like) to signal than a given text file is UTF-8. The UTF-8 BOM does not say anything about byte order; and, actually, it does not change with endianness. (note that it is not required to put an UTF-8 BOM at the beginning of text files; it is just a hint that some tools use when generating/reading UTF-8) 2- If that were true, can you point me to some documentation about the math that, as Mark says, demonstrates this? Math? UTF-8 is simply a byte-oriented (rather than word-oriented) encoding. There is no math involved, it just works by construction. Regards Antoine. thank you all, eventually found http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf#G7404 which clears up. No math in fact, as Tim and Antoine pointed out. -- http://mail.python.org/mailman/listinfo/python-list