Hi, It might be a little late for me to be writing this, considering the deadline for R6RS ratification has already passed, but Jens Soegaard advised me to write to this list to report some issues I have with the Unicode and string libraries in R6RS, for archive or so that the issues can be resolved. Some of these issues have been discussed before, on various mailing lists including this one, so I'll try not to dwell on those. (In the following list, by "comparison" I mean the functions for < <= > and >= on characters and strings.)
1. In order to follow the suggestion in section 11.12 "Implementors should make string-ref run in constant time." only fixed-width encodings, like UTF-32 can be used. 2. There are no explicit provisions for a limited repertoire of characters, where resources are limited. It looks like all characters must be supported. 3. The ordering of characters and strings is done by the Unicode scalar values. In the base library, eight comparison operators are included, and the Unicode library adds eight more. However, these are only useful in very limited circumstances and create the misleading impression that they're suitable for collation for humans. (see UAX 10 for a more linguistically accurate collation) 4. The eight case folded comparison functions (like string-ci<?) appear to be an ad-hoc attempt at something closer to appropriate collation, where "abc" comes before "AZZ". However, a better approach would be to use case as a tie breaker. A similar approach is needed for accent marks, if the output is to be consumed by humans. The inclusion of these comparisons is misleading and of very limited utility. (again, see UAX 10) 5. Case conversion in the Unicode library does not incorporate locale, so a third-party library will have to be used to provide correct behavior in the Turkish, Azeri and Lithuanian locales. 6. The functions char-upcase, char-downcase, char-titlecase and char-casefold are inappropriate and incomplete, since they do not incorporate information from SpecialCasing.txt in the Unicode Character Database. They are implementation details of the corresponding operations on strings, and should not be exposed in the standard library. I understand that this library is trying to be a minimal base, but it also should be useful enough to write basic international applications. With this standard, third party libraries will be needed for certain functionality to operate correctly, in particular in casing operations. It's difficult to design a good Unicode library, but each of these problems could be fixed without too much trouble. Daniel Ehrenberg _______________________________________________ r6rs-discuss mailing list [email protected] http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss
