> >> Code units (whether UTF-8, UTF-16, UTF-32, or whatever) are > >> bit patterns that are used to encode Unicode scalar values. > >> As programmers and as language designers, one of our guiding > >> principles is that bit patterns don't matter except where > >> they are forced upon us by the external world, typically via > >> i/o. > > > > Or when the abstraction leaks, as string-ref does for UTF-8 and > > UTF-16. Do > > you think that being able to write string-find portably & > > efficiently is > > important? > > I must've missed it somewhere, so let me ask the stupid question. > What's the problem with the current draft that prohibits implementing > string-find portably and efficiently? All I can find in the archives > are the following two statements:
UTF-8 and UTF-16 require one or more code units to represent a given scalar value. Since the number of code units depends on the scalar value being encoded there's no algorithm that maps the i'th scalar value to the j'th code unit. If you want the i'th scalar value in a UTF-8 or UTF-16 string you have to search for it. And that, of course, is what string-ref is, a request for the i'th scalar value (returned as a character). A simple string-find would string-ref each character in a string, and (given only R5.92RS and UTF-8 or UTF-16) each string-ref would start from scratch. There are at least four schools of thought on all of this. First, I believe that some people think a sufficiently smart compiler could hide some/many/most of these issues by, for example, caching information or switching to another encoding on the fly. Second, I believe that some people think the problem can be resolved or reduced by adding new abstractions--eg, string-for-each. Third, I believe that some people think there's nothing wrong with a lower-level API--eg, one that exposes code units--it simply shouldn't get standardized. Fourth, some people think that Unicode encodings are inherently leaky and that a lower-level API should be standardized in order to allow for portable and efficient string algorithms. Of course, these positions aren't all mutually exclusive. _______________________________________________ r6rs-discuss mailing list [email protected] http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss
