Rick McGowan wrote: > the process as possible so that it can be considered > The draft is found at http://www.unicode.org/reports/tr31/ > and feedback can be submitted as described there.
(Before submitting official feedback, I'd like to discuss my comments here. BTW, which "Type of Message" should I use in the feedback form? Is it OK to use "Technical Report or Tech Note issues"?) My two cents are both about adding characters in the <Pattern_Syntax> of "4.1 Proposed Pattern Properties". IMHO: 1. Full-width, half-width, and "small" punctuation characters should in class <Pattern_Syntax> as their "normal width" counterparts. 2. Non-Latin punctuation character should be in class <Pattern_Syntax> as their Latin counterparts. The rationale for suggestion 1 is that <wide>, <narrow> and <small> compatibility characters are substantially identical (in appearance and function) to their "normal width" counterparts. A parser allowing an unquoted full-width punctuation character in an identifier is guaranteed to cause confusion to the user. E.g., consider the following expression: foo,bar To me, it *definitely* looks like two identifiers separated by a comma, and I expect my parser to agree with me on this, even if the "comma" is actually a full-width comma. I am not saying that the parser must necessarily accept a full-width comma in that position: it is perfectly OK if the above expression causes a syntax error such as: "Illegal character U+FF0C (FULLWIDTH COMMA) after identifier <foo>'". But what the parser should absolutely *not* do, IMHO, is handling "foo,bar" as a *single* identifier! Doing such a thing is guaranteed to cause troubles to me. E.g., I might receive a puzzling error message saying: "Parameter missing: this statement requires 2 parameters", while I can *see* that there *are* two parameters: "foo" and "bar"... The rationale for suggestion 2 is very similar. E.g., the following expression looks a perfectly legal C++ or Java statement: return; If the compiler tells me: "Undeclared identifier", I may get crazy for the whole day trying to figure out what's going on... But if tells me "Illegal character U+037E (GREEK QUESTION MARK) after keyword <return>", then I immediately understand that something is wrong with that "semicolon". The reason I keep suggestions 1 and 2 separate is that, in the case of <wide>, <narrow> and <small> compatibility characters, it is trivial to determine the corresponding regular character, while in the case of non-Latin punctuation there is room for discussing which punctuation characters are similar enough (in function or appearance) to which Latin punctuation character. For full-width, half-width, and "small" punctuation characters, my suggestion is to add the following lines to "4.1 Proposed Pattern Properties": FE50..FE52 ; Pattern_Syntax # SMALL COMMA..SMALL FULL STOP FE54..FE57 ; Pattern_Syntax # SMALL SEMICOLON..SMALL EXCLAMATION MARK FE59..FE66 ; Pattern_Syntax # SMALL LEFT PARENTHESIS..SMALL EQUALS SIGN FE68..FE6B ; Pattern_Syntax # SMALL REVERSE SOLIDUS..SMALL COMMERCIAL AT FF01..FF0F ; Pattern_Syntax # FULLWIDTH EXCLAMATION MARK..FULLWIDTH SOLIDUS FF1A..FF20 ; Pattern_Syntax # FULLWIDTH COLON..FULLWIDTH COMMERCIAL AT FF3B..FF40 ; Pattern_Syntax # FULLWIDTH LEFT SQUARE BRACKET..FULLWIDTH GRAVE ACCENT FF5B..FF5E ; Pattern_Syntax # FULLWIDTH LEFT CURLY BRACKET..FULLWIDTH TILDE FF5F..FF61 ; Pattern_Syntax # FULLWIDTH LEFT WHITE PARENTHESIS..HALFWIDTH IDEOGRAPHIC FULL STOP FF64 ; Pattern_Syntax # HALFWIDTH IDEOGRAPHIC COMMA FFE0..FFE2 ; Pattern_Syntax # FULLWIDTH CENT SIGN..FULLWIDTH NOT SIGN FFE4..FFE5 ; Pattern_Syntax # FULLWIDTH BROKEN BAR..FULLWIDTH YEN SIGN FFE8..FFEE ; Pattern_Syntax # HALFWIDTH FORMS LIGHT VERTICAL..HALFWIDTH WHITE CIRCLE For non-Latin punctuation characters, this is my tentative list of characters that may cause trouble if used in identifiers, and which, consequently, should be added to class <Pattern_Syntax>: 037E GREEK QUESTION MARK 0387 GREEK ANO TELEIA 055C ARMENIAN EXCLAMATION MARK 055D ARMENIAN COMMA 055E ARMENIAN QUESTION MARK 0589 ARMENIAN FULL STOP 060C ARABIC COMMA 060D ARABIC DATE SEPARATOR 061B ARABIC SEMICOLON 061F ARABIC QUESTION MARK 066A ARABIC PERCENT SIGN 066B ARABIC DECIMAL SEPARATOR 066C ARABIC THOUSANDS SEPARATOR 06D4 ARABIC FULL STOP 0964 DEVANAGARI DANDA 0965 DEVANAGARI DOUBLE DANDA 10FB GEORGIAN PARAGRAPH SEPARATOR 1362 ETHIOPIC FULL STOP 1363 ETHIOPIC COMMA 1364 ETHIOPIC SEMICOLON 1365 ETHIOPIC COLON 1366 ETHIOPIC PREFACE COLON 1367 ETHIOPIC QUESTION MARK 1368 ETHIOPIC PARAGRAPH SEPARATOR 166E CANADIAN SYLLABICS FULL STOP 1802 MONGOLIAN COMMA 1803 MONGOLIAN FULL STOP 1804 MONGOLIAN COLON 1808 MONGOLIAN MANCHU COMMA 1809 MONGOLIAN MANCHU FULL STOP 1944 LIMBU EXCLAMATION MARK 1945 LIMBU QUESTION MARK But I am not 100% about all the above characters. Should any of them be removed from the list (i.e., allowed in identifiers)? The following list includes all the non-Latin punctuation character which I feel not worth including in class <Pattern_Syntax>, because I think that, for a reason or another, they would cause no problem in identifiers: 055A ARMENIAN APOSTROPHE 055B ARMENIAN EMPHASIS MARK 055F ARMENIAN ABBREVIATION MARK 058A ARMENIAN HYPHEN 05BE HEBREW PUNCTUATION MAQAF 05C0 HEBREW PUNCTUATION PASEQ 05C3 HEBREW PUNCTUATION SOF PASUQ 05F3 HEBREW PUNCTUATION GERESH 05F4 HEBREW PUNCTUATION GERSHAYIM 066D ARABIC FIVE POINTED STAR 0700 SYRIAC END OF PARAGRAPH 0701 SYRIAC SUPRALINEAR FULL STOP 0702 SYRIAC SUBLINEAR FULL STOP 0703 SYRIAC SUPRALINEAR COLON 0704 SYRIAC SUBLINEAR COLON 0705 SYRIAC HORIZONTAL COLON 0706 SYRIAC COLON SKEWED LEFT 0707 SYRIAC COLON SKEWED RIGHT 0708 SYRIAC SUPRALINEAR COLON SKEWED LEFT 0709 SYRIAC SUBLINEAR COLON SKEWED RIGHT 070A SYRIAC CONTRACTION 070B SYRIAC HARKLEAN OBELUS 070C SYRIAC HARKLEAN METOBELUS 070D SYRIAC HARKLEAN ASTERISCUS 0970 DEVANAGARI ABBREVIATION SIGN 0DF4 SINHALA PUNCTUATION KUNDDALIYA 0E4F THAI CHARACTER FONGMAN 0E5A THAI CHARACTER ANGKHANKHU 0E5B THAI CHARACTER KHOMUT 0F04 TIBETAN MARK INITIAL YIG MGO MDUN MA 0F05 TIBETAN MARK CLOSING YIG MGO SGAB MA 0F06 TIBETAN MARK CARET YIG MGO PHUR SHAD MA 0F07 TIBETAN MARK YIG MGO TSHEG SHAD MA 0F08 TIBETAN MARK SBRUL SHAD 0F09 TIBETAN MARK BSKUR YIG MGO 0F0A TIBETAN MARK BKA- SHOG YIG MGO 0F0B TIBETAN MARK INTERSYLLABIC TSHEG 0F0C TIBETAN MARK DELIMITER TSHEG BSTAR 0F0D TIBETAN MARK SHAD 0F0E TIBETAN MARK NYIS SHAD 0F0F TIBETAN MARK TSHEG SHAD 0F10 TIBETAN MARK NYIS TSHEG SHAD 0F11 TIBETAN MARK RIN CHEN SPUNGS SHAD 0F12 TIBETAN MARK RGYA GRAM SHAD 0F3A TIBETAN MARK GUG RTAGS GYON 0F3B TIBETAN MARK GUG RTAGS GYAS 0F3C TIBETAN MARK ANG KHANG GYON 0F3D TIBETAN MARK ANG KHANG GYAS 0F85 TIBETAN MARK PALUTA 104A MYANMAR SIGN LITTLE SECTION 104B MYANMAR SIGN SECTION 104C MYANMAR SYMBOL LOCATIVE 104D MYANMAR SYMBOL COMPLETED 104E MYANMAR SYMBOL AFOREMENTIONED 104F MYANMAR SYMBOL GENITIVE 1361 ETHIOPIC WORDSPACE 166D CANADIAN SYLLABICS CHI SIGN 169B OGHAM FEATHER MARK 169C OGHAM REVERSED FEATHER MARK 16EB RUNIC SINGLE PUNCTUATION 16EC RUNIC MULTIPLE PUNCTUATION 16ED RUNIC CROSS PUNCTUATION 1735 PHILIPPINE SINGLE PUNCTUATION 1736 PHILIPPINE DOUBLE PUNCTUATION 17D4 KHMER SIGN KHAN 17D5 KHMER SIGN BARIYOOSAN 17D6 KHMER SIGN CAMNUC PII KUUH 17D8 KHMER SIGN BEYYAL 17D9 KHMER SIGN PHNAEK MUAN 17DA KHMER SIGN KOOMUUT 1800 MONGOLIAN BIRGA 1801 MONGOLIAN ELLIPSIS 1805 MONGOLIAN FOUR DOTS 1806 MONGOLIAN TODO SOFT HYPHEN 1807 MONGOLIAN SIBE SYLLABLE BOUNDARY MARKER 180A MONGOLIAN NIRUGU 10100 AEGEAN WORD SEPARATOR LINE 10101 AEGEAN WORD SEPARATOR DOT 1039F UGARITIC WORD DIVIDER Should any of the above character be added to <Pattern_Syntax> (i.e. *not* allowed in identifiers)? _ Marco