Re: [r6rs-discuss] Strings

Aubrey Jaffer Sun, 25 Mar 2007 08:59:37 -0800

 | From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>
 | Date: Sun, 25 Mar 2007 12:46:49 +0200
 | 
 | Dnia 24-03-2007, sob o godzinie 13:31 -0400, [EMAIL PROTECTED]
 | napisał(a):
 | 
 | > Summary
 | > "This document attempts to make the case that it is advantageous to use 
 | > UTF-16 (or 16-bit Unicode strings) for text processing..."
 | 
 | IMHO this is one of the worst mistakes Unicode is trying to make.
 | It convinces people that they should not worry about characters above
 | U+FFFF just because they are very rare. UTF-16 combines the worst
 | aspects of UTF-8 and UTF-32.
 | 
 | If size is important and variable width of the representation of a code
 | point is acceptable, then UTF-8 is usually a better choice. If O(1)
 | indexing by code points is important, then UTF-32 it better. Nobody
 | wants to process texts in terms of UTF-16 code units. Nobody wants to
 | have surrogate processing sprinkled around the code, and thus if one
 | accepts an API which extracts variable width characters, then the API
 | could as well deal with UTF-8, which is better for interoperability.
 | UTF-16 makes no sense.


I agree.

There also seems to be a hidden assumption in some posts that
character alignment can only be recovered if a string is scanned from
the beginning.  This is not the case.

Character alignment can be discovered from any octet within a UTF-8
encoded string.  The octet which begins a code point can never be
mistaken for the subsequent octets, which always have the most
significant two bits #b10.

There are algorithms (like binary search) which access a string at
approximate locations.  The asymptotic running time of such algorithms
will not be impacted by using strings coded in UTF-8.

_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Re: [r6rs-discuss] Strings

Reply via email to