[r6rs-discuss] Proposed features for small Scheme, part 3: Unicode

John Cowan Mon, 07 Sep 2009 20:49:08 -0700

My views on small Scheme and Unicode are developed from the following
principles:


Principle #1: No small Scheme implementation is required to support any
specific Unicode character or repertoire (collection of characters),
with the obvious exception of the ASCII repertoire.

Principle #2: Unicode is the predominant character standard today, and a
small Scheme implementation's treatment of characters must conform to it,
insofar as this does not conflict with Principle #1.

>From these principles I draw the following (wearisomely detailed)
conclusions (where "Scheme" means "small Scheme, as proposed by me"):

        1) The char->integer procedure must return an exact integer
        between 0 and #xD7FF or between #xE000 and #x10FFF when applied
        to a character supported by the implementation and belonging to
        the Unicode repertoire.  This integer must be the Unicode scalar
        value of the character.

        This is independent of the implementation's internal
        representation.  For example, a Scheme that supports a repertoire
        of  Latin and modern Greek characters only might use the
        ISO 8859-7 encoding internally, in which lower-case lambda is
        represented as #xEB; but char->integer must still return #x03BB
        on that character.

        An ASCII-only Scheme satisfies this requirement automatically,
        provided it does not deliberately scramble the natural result.
        (EBCDIC-based Schemes already have ASCII conversion tables.)

        If the implementation supports non-Unicode characters (ones
        with bucky bits, e.g.), then char->integer must return an exact
        integer less than 0 or greater than #x10FFFF when applied to
        such characters.

        2) The integer->char procedure, when applied to an exact integer
        that char->integer returns when applied to some character c,
        must return c.

        An ASCII-only Scheme also satisfies this requirement
        automatically, with the same proviso.

        3) The char-downcase procedure, given an argument that forms the
        uppercase part of a Unicode upper/lower-case pair, must return
        the lowercase member of the pair, provided that the character
        is supported by the Scheme implementation.  Turkic casing pairs
        are ignored.  If the argument is not the uppercase part of such
        a pair, it is returned.

        4) The char-upcase procedure works the same way, mutatis mutandis.
        Note that many Unicode lowercase characters don't have uppercase
        equivalents.

        5) The char-foldcase procedure applies the Unicode simple
        case-folding algorithm to its argument, ignoring the Turkic
        mappings.  Mappings that don't accept or don't produce single
        characters are ignored.

        In an ASCII-only Scheme, this is equivalent to the char-downcase
        procedure.  This procedure is an extension to R5RS.

        6) The char-ci* procedures behave as if char-foldcase was
        applied to their arguments before calling the respective non-ci
        procedures.

        7) The procedures char-{alphabetic,numeric,whitespace,upper-case,
        lower-case}? return #t if their arguments have the Unicode
        properties Alphabetic, Numeric, White_Space, Uppercase, or
        Lowercase respectively.  Note that many alphabetic characters
        (though no ASCII ones) are neither upper nor lower case.

        8) The string-downcase procedure applies the Unicode full
        uppercasing algorithm to its argument.  This may cause the
        result to differ in length from the argument.  What is more,
        some characters have case-mappings that depend on the surrounding
        context.  For example, Greek capital sigma normally downcases
        to Greek small sigma, but at the end of a word it downcases to
        Greek small final sigma instead.

        For an ASCII-only Scheme, string-downcase is a straightforward
        application of map to char-downcase.

        9) The string-upcase and string-foldcase apply the Unicode full
        uppercasing and case folding algorithms, with the same provisos.
        String-foldcase is an extension to R5RS.

        For an ASCII-only Scheme, these procedures are a straightforward
        application of map to char-upcase and char-downcase, respectively.

        10) The string-ci* procedures act as if they applied
        string-foldcase to their arguments before calling the non-ci
        versions.

        For an ASCII-only Scheme, this amounts to calling either
        char-downcase or char-upcase on each character of each string.

        11) In addition to the identifier characters of the ASCII
        repertoire specified by R5RS, Scheme implementations may permit
        any additional repertoire of Unicode characters to be employed in
        identifiers, provided that each character has a Unicode general
        category of Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd, Nl, No, Pd, Pc,
        Po, Sc, Sm, Sk, So, or Co.  No non-Unicode characters may be
        used in identifiers.

        12) All Scheme implementations shall permit the sequence
        "\x<hexdigits>;" to appear in Scheme identifiers.  If the
        character with the given Unicode scalar value is supported
        by the implementation, this sequence must be replaced
        by the corresponding character; if not, it is left alone.

        This causes symbol->string not to produce the same string on all
        implementations.  For example, the hypothetical implementation
        above would have (symbol->string '\x3BB;) produce a one-character
        string, whereas an ASCII-only Scheme would produce a six-character
        string.

        I believe this to be tolerable, given that existing R5RS
        implementations may return "Foo", "FOO", or "foo" as the value
        of (symbol->string 'Foo); the first of these is technically not
        R5RS-compliant, but is very common anyway.

-- 
John Cowan    [email protected]    http://ccil.org/~cowan
Heckler: "Go on, Al, tell 'em all you know.  It won't take long."
Al Smith: "I'll tell 'em all we *both* know.  It won't take any longer."

_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

[r6rs-discuss] Proposed features for small Scheme, part 3: Unicode

Reply via email to