Re: [r6rs-discuss] Strings as codepoint-vectors: bad

Jason Orendorff Sun, 18 Mar 2007 12:25:41 -0800

On 3/16/07, Thomas Lord <[EMAIL PROTECTED]> wrote:

The generic error is wishing for some "easy way out" that
makes Unicode as easy to hack as ASCII.   Won't happen.
Text is just not that simple.   Unicode does a fantastic job of
making it "... but no simpler".


Obviously I've done a very poor job expressing myself.

There is, as you mentioned elsewhere, a tower here:
 - text
 - grapheme clusters
 - Unicode scalar values
 - code units

R6RS presents strings as sequences of Unicode scalar values, as though
(a) nothing much useful can be done with the code units; (b) if the
code units are hidden, implementors can reasonably choose whatever
representation they want, and (c) just hiding code units is very
helpful to programmers.  All three statements are false.

(a) UTF-8 and UTF-16 were designed to facilitate writing efficient
algorithms.  Hiding them hides this facility.  R5.92RS leaves the
programmer with neither (string-find) nor a decent way to implement
it.

(b) Any implementation that chooses to represent strings in UTF-8 or
UTF-16 will have unacceptably bad performance running simple portable
code that uses (string-ref), because (string-ref) will be O(N).

(c) If you know Unicode, it's not hard to work with code units.  UTF-8
and UTF-16 were explicitly designed with this in mind.  If you don't
know Unicode, you're unlikely to write correct code on top of the
R5.92RS libraries anyway.  Hiding code units eliminates exactly one
pitfall--among *many*.

There's no "easy way out" aspect to it.  The string abstraction in
R5.92RS simply doesn't make sense to me as an abstraction.

-j

_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Re: [r6rs-discuss] Strings as codepoint-vectors: bad

Reply via email to