Re: [r6rs-discuss] Re: [Formal] formal comment (ports, characters, strings, Unicode)

MichaelL Mon, 19 Mar 2007 16:18:24 -0800

> >> Code units (whether UTF-8, UTF-16, UTF-32, or whatever) are
> >> bit patterns that are used to encode Unicode scalar values.
> >> As programmers and as language designers, one of our guiding
> >> principles is that bit patterns don't matter except where
> >> they are forced upon us by the external world, typically via
> >> i/o.
> >
> > Or when the abstraction leaks, as string-ref does for UTF-8 and 
> > UTF-16. Do
> > you think that being able to write string-find portably & 
> > efficiently is
> > important?
> 
> I must've missed it somewhere, so let me ask the stupid question.
> What's the problem with the current draft that prohibits implementing
> string-find portably and efficiently?  All I can find in the archives
> are the following two statements:


UTF-8 and UTF-16 require one or more code units to represent a given 
scalar value. Since the number of code units depends on the scalar value 
being encoded there's no algorithm that maps the i'th scalar value to the 
j'th code unit. If you want the i'th scalar value in a UTF-8 or UTF-16 
string you have to search for it. And that, of course, is what string-ref 
is, a request for the i'th scalar value (returned as a character).

A simple string-find would string-ref each character in a string, and 
(given only R5.92RS and UTF-8 or UTF-16) each string-ref would start from 
scratch. 

There are at least four schools of thought on all of this. First, I 
believe that some people think a sufficiently smart compiler could hide 
some/many/most of these issues by, for example, caching information or 
switching to another encoding on the fly. Second, I believe that some 
people think the problem can be resolved or reduced by adding new 
abstractions--eg, string-for-each. Third, I believe that some people think 
there's nothing wrong with a lower-level API--eg, one that exposes code 
units--it simply shouldn't get standardized. Fourth, some people think 
that Unicode encodings are inherently leaky and that a lower-level API 
should be standardized in order to allow for portable and efficient string 
algorithms. Of course, these positions aren't all mutually exclusive.


_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Re: [r6rs-discuss] Re: [Formal] formal comment (ports, characters, strings, Unicode)

Reply via email to