Bryan C Warnock <[EMAIL PROTECTED]> writes:

> Some additional stuff to ponder over, and maybe Unicode addresses these
> - I haven't been able to read *all* the Unicode stuff yet.  (And, yes,
> Simon, you will see me in class.)

> Some languages don't have upper or lower case.  Are tests and
> translations on caseless characters true or false?  (Or undefined?)

Caseless characters should be guaranteed unchanged by conversion to upper
or lower case, IMO.  Case is a normative property of characters in
Unicode, so case mappings should actually be pretty well-defined.  Note
that there are actually three cases in Unicode, upper, lower, and title
case, since there are some characters that require the third distinction
(stuff like Dz is generally used as an example).

> Should the same Unicode character, when used in two different languages,
> be string equivalent?

The way to start solving this whole problem is probably through
normalization; Unicode defines two separate normalizations, one of which
collapses more similar characters than the other.  One is designed to
preserve formatting information while the other loses formatting
information.  (The best example of how they differ is that one leaves the
ffi ligature alone and the other breaks it down into three separate
characters.)  Perl should allow programmers to choose their preferred
normalization schemes or none at all.

(There are really four normalization schemes; in two of them, you leave
things fully decomposed, and in the other two you recompose characters as
much as possible.)

-- 
Russ Allbery ([EMAIL PROTECTED])             <http://www.eyrie.org/~eagle/>

Reply via email to