Jim Jewett writes: > On 6/5/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > > > It seems to me that what UAX#31 is saying is "Distinguishing (or not) > > between 0035 DIGIT 3 and 2075 SUPERSCRIPT 3 should be > > equivalent to distinguishing (or not) between LATIN CAPITAL > > LETTER A and LATIN SMALL LETTER A." I don't know that > > I agree (or disagree) in principle. > > So effectively, they consider "a" and "A" to be presentational variants.
Well, no, they're pretty explicit that they have semantic content, as do superscripts. This is different from the Arabic initial, medial, and final forms, ligatures, the Croatian digraphs, and the Japanese double-byte ASCII, where there is no semantic content (not even word division for Arabic AFAIK), use is just required by "the rules" (for Arabic) or is 100% at the discretion of the user (ASCII variants). > In some languages, certain presentational variants are used depending > on word position. I think the ID_START property does exclude letters > that cannot appear in an initial position, but putting a final > character in the middle or vice versa would still be wrong. Good point. I'm going to interview some Arabic speakers who I believe have some programming skills; I'll add that to the list. > If identifiers are built up in the equivalent of > > handler="do_" + name I think this is pretty likely, and one of the attractions of languages like Python. > The folding rules do say that it is OK (even good) to exclude certain > characters from certain foldings; I think we could preserve case > (including title-case?) as the only presentational variant we > recognize. AFAICS from looking at the V2 table, case is an *analogy* used by UAX#31 to clarify when NKFC is useful. NKFC itself does not fold case, it is considered appropriate if you have a language that folds case anyway. > http://www.unicode.org/versions/corrigendum3.html suggests that many > of the Hangul are either pronunciation guide variants or even exact > duplicates (that were presumably missed when the canonicalization was > frozen?) I'll have to ask some Koreans what they would use. > """It is recommended that all Arabic presentation forms be excluded > from identifiers in any event, although only a few of them must be > excluded for normalization to guarantee identifier closure.""" Cool. I'll ask that, too. > Depends on what you mean by technical symbols. Eg, the letterlike symbols (DEGREE CELSIUS), the number forms (ROMAN NUMERAL ONE), and the APL set (2336--237A) in the BMP. [[ I really need to put together some tools to access that database from XEmacs.... ]] > IMHO, many of them are in fact listed as ID characters. The math > versions (generally 1D400 - 1DC7B) are included. But > http://unicode.org/reports/tr39/data/xidmodifications.txt suggests > excluding them again. I'm not really worried about people using characters outside the BMP very often, any more than people use an embedded comma in LISP identifiers or file names (eg RCS ,v), unless they use a script lately admitted to Unicode, or if they just wish to tempt the wrath of the gods. The former will not have a problem, and the latter can look out for themselves, I'm sure. _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
