Re: [r6rs-discuss] Case sensitivity

John Cowan Tue, 24 Feb 2009 07:05:08 -0800

Eli Barzilay scripsit:

> In any case, if you remember, I didn't join this thread from this
> side.  What always disturbed more was the arbitrary decision to treat
> the case bit differently than many other similar bits.  In the ASCII
> world that Scheme was born to, this was a very minor wart.  (I don't
> know the details of punched cards, but I'd guess that Lisp was born in
> a world that didn't have that bit.)


Quite so.  The IBM 704 (released in 1954) used a 6-bit character code
that implemented only upper-case characters.  Lisp was born in 1958
on that hardware.  IBM only came out with EBCDIC, which provided full
support for upper and lower case, in 1963-64, at the same time that
ASCII was standardized.  (Ironically, IBM strongly supported ASCII,
but wasn't able to cut over its entire production line of online and
offline peripherals to support it before System/360 was released --
so System/360 and all its successors are EBCDIC-based to this day.)
Fortran was also born on the 704, as was MUSIC, the program that
generated Hal's singing voice in 2001: A Space Odyssey.

> But these days ignoring something like unicode is no longer an option.
> Given this, one solution is to keep the symmetry: the language is
> still case insensitive, but it's done with unicode folding rules or
> something similar -- so all similar bits have the same status.  That
> would be, IMO, the proper way of keeping case-insensitivite.  But
> there is a big problem here -- unicode has versions, and the rules are
> likely to change, which means that code can break as a result.  The
> fundamental problem (again, IMO) here is that it's a redundant mixture
> of cultural rules with a formal language.  For all I know, it might be
> decided tomorrow that "a" and "A" are no longer related, or that the
> capital form of "a" is "A" or "$,1,p" or whatever.  I obviously don't
> think that this will ever happen -- but that is ultimately an issue of
> human culture.

Actually, it won't.  Unicode (which is designed to last the centuries,
adding new characters but not changing old ones) has very specific
stability policies on what is guaranteed about future versions at
http://www.unicode.org/policies/stability_policy.html .  In particular:

Case Folding Stability

Applicable Version: Unicode 5.0+

Caseless matching of Unicode strings used for identifiers is stable.

Case folding stability ensures that identifiers created in different
versions of Unicode can be reliably matched in a case-insensitive
manner. For more information on identifiers see UAX #31: Identifier and
Pattern Syntax. Identifiers commonly exclude compatibility decomposable
characters; therefore this policy formally applies only to strings
normalized with NFKC. The toCaseFold() operation used for caseless
matching is the full case folding defined by rule R4 under "Default
Case Conversion" in Section 3.13, Default Case Algorithms of the Unicode
Standard.

The formal statement of this policy is:

    For each string S containing characters only from a given Unicode
    version, toCasefold(toNFKC(S)) under that version is identical to
    toCasefold(toNFKC(S)) under any later version of Unicode.

Case Pair Stability

Applicable Version: Unicode 5.0+

Two distinct assigned characters form a case pair when first character of
the pair is the full uppercase of the second character, and the second
character is the full lowercase of the first character. (Full upper-and
lowercase are defined in Section 3.13 of the Unicode Standard.)

If two characters form a case pair in a version of Unicode, they will
remain a case pair in each subsequent version of Unicode.

If two characters do not form a case pair in a version of Unicode,
they will never become a case pair in any subsequent version of Unicode.

More formally, for given versions V and U of Unicode, and any two distinct
characters X and Y that are both assigned according to both V and U:

toLowercaseV(X) = Y AND toUppercaseV(Y) = X

if and only if

toLowercaseU(X) = Y AND toUppercaseU(Y) = X

Note that these conditions apply to two existing, distinct assigned
characters. A character that is not part of a case pair could become part
of one if the new case pair is formed at the time of the addition of a
new character to Unicode. For example, a new capital version of U+028D
LATIN SMALL LETTER TURNED W could be added in the future to form a new
case pair.

-- 
Possession is said to be nine points of the law,                John Cowan
but that's not saying how many points the law might have.       [email protected]
        --Thomas A. Cowan (law professor and my father)

_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Re: [r6rs-discuss] Case sensitivity

Reply via email to