[r6rs-discuss] Unicode issues

Daniel Ehrenberg Tue, 28 Aug 2007 07:49:28 -0700

Hi,

It might be a little late for me to be writing this, considering the
deadline for R6RS ratification has already passed, but Jens Soegaard
advised me to write to this list to report some issues I have with the
Unicode and string libraries in R6RS, for archive or so that the
issues can be resolved. Some of these issues have been discussed
before, on various mailing lists including this one, so I'll try not
to dwell on those. (In the following list, by "comparison" I mean the
functions for < <= > and >= on characters and strings.)


1. In order to follow the suggestion in section 11.12 "Implementors
should make string-ref run in constant time." only fixed-width
encodings, like UTF-32 can be used.

2. There are no explicit provisions for a limited repertoire of
characters, where resources are limited. It looks like all characters
must be supported.

3. The ordering of characters and strings is done by the Unicode
scalar values. In the base library, eight comparison operators are
included, and the Unicode library adds eight more. However, these are
only useful in very limited circumstances and create the misleading
impression that they're suitable for collation for humans. (see UAX 10
for a more linguistically accurate collation)

4. The eight case folded comparison functions (like string-ci<?)
appear to be an ad-hoc attempt at something closer to appropriate
collation, where "abc" comes before "AZZ". However, a better approach
would be to use case as a tie breaker. A similar approach is needed
for accent marks, if the output is to be consumed by humans. The
inclusion of these comparisons is misleading and of very limited
utility. (again, see UAX 10)

5. Case conversion in the Unicode library does not incorporate locale,
so a third-party library will have to be used to provide correct
behavior in the Turkish, Azeri and Lithuanian locales.

6. The functions char-upcase, char-downcase, char-titlecase and
char-casefold are inappropriate and incomplete, since they do not
incorporate information from SpecialCasing.txt in the Unicode
Character Database. They are implementation details of the
corresponding operations on strings, and should not be exposed in the
standard library.

I understand that this library is trying to be a minimal base, but it
also should be useful enough to write basic international
applications. With this standard, third party libraries will be needed
for certain functionality to operate correctly, in particular in
casing operations. It's difficult to design a good Unicode library,
but each of these problems could be fixed without too much trouble.

Daniel Ehrenberg

_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

[r6rs-discuss] Unicode issues

Reply via email to