My views on small Scheme and Unicode are developed from the following
principles:
Principle #1: No small Scheme implementation is required to support any
specific Unicode character or repertoire (collection of characters),
with the obvious exception of the ASCII repertoire.
Principle #2: Unicode is the predominant character standard today, and a
small Scheme implementation's treatment of characters must conform to it,
insofar as this does not conflict with Principle #1.
>From these principles I draw the following (wearisomely detailed)
conclusions (where "Scheme" means "small Scheme, as proposed by me"):
1) The char->integer procedure must return an exact integer
between 0 and #xD7FF or between #xE000 and #x10FFF when applied
to a character supported by the implementation and belonging to
the Unicode repertoire. This integer must be the Unicode scalar
value of the character.
This is independent of the implementation's internal
representation. For example, a Scheme that supports a repertoire
of Latin and modern Greek characters only might use the
ISO 8859-7 encoding internally, in which lower-case lambda is
represented as #xEB; but char->integer must still return #x03BB
on that character.
An ASCII-only Scheme satisfies this requirement automatically,
provided it does not deliberately scramble the natural result.
(EBCDIC-based Schemes already have ASCII conversion tables.)
If the implementation supports non-Unicode characters (ones
with bucky bits, e.g.), then char->integer must return an exact
integer less than 0 or greater than #x10FFFF when applied to
such characters.
2) The integer->char procedure, when applied to an exact integer
that char->integer returns when applied to some character c,
must return c.
An ASCII-only Scheme also satisfies this requirement
automatically, with the same proviso.
3) The char-downcase procedure, given an argument that forms the
uppercase part of a Unicode upper/lower-case pair, must return
the lowercase member of the pair, provided that the character
is supported by the Scheme implementation. Turkic casing pairs
are ignored. If the argument is not the uppercase part of such
a pair, it is returned.
4) The char-upcase procedure works the same way, mutatis mutandis.
Note that many Unicode lowercase characters don't have uppercase
equivalents.
5) The char-foldcase procedure applies the Unicode simple
case-folding algorithm to its argument, ignoring the Turkic
mappings. Mappings that don't accept or don't produce single
characters are ignored.
In an ASCII-only Scheme, this is equivalent to the char-downcase
procedure. This procedure is an extension to R5RS.
6) The char-ci* procedures behave as if char-foldcase was
applied to their arguments before calling the respective non-ci
procedures.
7) The procedures char-{alphabetic,numeric,whitespace,upper-case,
lower-case}? return #t if their arguments have the Unicode
properties Alphabetic, Numeric, White_Space, Uppercase, or
Lowercase respectively. Note that many alphabetic characters
(though no ASCII ones) are neither upper nor lower case.
8) The string-downcase procedure applies the Unicode full
uppercasing algorithm to its argument. This may cause the
result to differ in length from the argument. What is more,
some characters have case-mappings that depend on the surrounding
context. For example, Greek capital sigma normally downcases
to Greek small sigma, but at the end of a word it downcases to
Greek small final sigma instead.
For an ASCII-only Scheme, string-downcase is a straightforward
application of map to char-downcase.
9) The string-upcase and string-foldcase apply the Unicode full
uppercasing and case folding algorithms, with the same provisos.
String-foldcase is an extension to R5RS.
For an ASCII-only Scheme, these procedures are a straightforward
application of map to char-upcase and char-downcase, respectively.
10) The string-ci* procedures act as if they applied
string-foldcase to their arguments before calling the non-ci
versions.
For an ASCII-only Scheme, this amounts to calling either
char-downcase or char-upcase on each character of each string.
11) In addition to the identifier characters of the ASCII
repertoire specified by R5RS, Scheme implementations may permit
any additional repertoire of Unicode characters to be employed in
identifiers, provided that each character has a Unicode general
category of Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd, Nl, No, Pd, Pc,
Po, Sc, Sm, Sk, So, or Co. No non-Unicode characters may be
used in identifiers.
12) All Scheme implementations shall permit the sequence
"\x<hexdigits>;" to appear in Scheme identifiers. If the
character with the given Unicode scalar value is supported
by the implementation, this sequence must be replaced
by the corresponding character; if not, it is left alone.
This causes symbol->string not to produce the same string on all
implementations. For example, the hypothetical implementation
above would have (symbol->string '\x3BB;) produce a one-character
string, whereas an ASCII-only Scheme would produce a six-character
string.
I believe this to be tolerable, given that existing R5RS
implementations may return "Foo", "FOO", or "foo" as the value
of (symbol->string 'Foo); the first of these is technically not
R5RS-compliant, but is very common anyway.
--
John Cowan [email protected] http://ccil.org/~cowan
Heckler: "Go on, Al, tell 'em all you know. It won't take long."
Al Smith: "I'll tell 'em all we *both* know. It won't take any longer."
_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss