| From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> | Date: Sun, 25 Mar 2007 12:46:49 +0200 | | Dnia 24-03-2007, sob o godzinie 13:31 -0400, [EMAIL PROTECTED] | napisaĆ(a): | | > Summary | > "This document attempts to make the case that it is advantageous to use | > UTF-16 (or 16-bit Unicode strings) for text processing..." | | IMHO this is one of the worst mistakes Unicode is trying to make. | It convinces people that they should not worry about characters above | U+FFFF just because they are very rare. UTF-16 combines the worst | aspects of UTF-8 and UTF-32. | | If size is important and variable width of the representation of a code | point is acceptable, then UTF-8 is usually a better choice. If O(1) | indexing by code points is important, then UTF-32 it better. Nobody | wants to process texts in terms of UTF-16 code units. Nobody wants to | have surrogate processing sprinkled around the code, and thus if one | accepts an API which extracts variable width characters, then the API | could as well deal with UTF-8, which is better for interoperability. | UTF-16 makes no sense.
I agree. There also seems to be a hidden assumption in some posts that character alignment can only be recovered if a string is scanned from the beginning. This is not the case. Character alignment can be discovered from any octet within a UTF-8 encoded string. The octet which begins a code point can never be mistaken for the subsequent octets, which always have the most significant two bits #b10. There are algorithms (like binary search) which access a string at approximate locations. The asymptotic running time of such algorithms will not be impacted by using strings coded in UTF-8. _______________________________________________ r6rs-discuss mailing list [email protected] http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss
