Note Followup-To: comp.lang.java.programmer Chris Uppal wrote: > Since the interpretation of characters which are yet to be added to > Unicode is undefined (will they be digits, "letters", operators, symbol, > punctuation.... ?), there doesn't seem to be any sane way that a language > could > allow an unrestricted choice of Unicode in identifiers. Hence, it must define > a specific allowed sub-set. C certainly defines an allowed subset of Unicode > characters -- so I don't think you could call its Unicode support "half-baked" > (not in that respect, anyway). A case -- not entirely convincing, IMO -- > could > be made that it would be better to allow a wider range of characters. > > And no, I don't think Java's approach -- where there /is no defined set of > allowed identifier characters/ -- makes any sense at all :-(
Java does have a defined set of allowed identifier characters. However, you certainly have to go around the houses a bit to work out what that set is: <http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#3.8> # An identifier is an unlimited-length sequence of Java letters and Java digits, # the first of which must be a Java letter. An identifier cannot have the same # spelling (Unicode character sequence) as a keyword (§3.9), boolean literal # (§3.10.3), or the null literal (§3.10.7). [...] # A "Java letter" is a character for which the method # Character.isJavaIdentifierStart(int) returns true. A "Java letter-or-digit" # is a character for which the method Character.isJavaIdentifierPart(int) # returns true. [...] # Two identifiers are the same only if they are identical, that is, have the # same Unicode character for each letter or digit. For Java 1.5.0: <http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html> # Character information is based on the Unicode Standard, version 4.0. <http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isJavaIdentifierStart(int)> # A character may start a Java identifier if and only if one of the following # conditions is true: # # * isLetter(codePoint) returns true # * getType(codePoint) returns LETTER_NUMBER # * the referenced character is a currency symbol (such as "$") [This means that getType(codePoint) returns CURRENCY_SYMBOL, i.e. Unicode General Category Sc.] # * the referenced character is a connecting punctuation character (such as "_"). [This means that getType(codePoint) returns CONNECTOR_PUNCTUATION, i.e. Unicode General Category Pc.] <http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isJavaIdentifierPart(int)> # A character may be part of a Java identifier if any of the following are true: # # * it is a letter # * it is a currency symbol (such as '$') # * it is a connecting punctuation character (such as '_') # * it is a digit # * it is a numeric letter (such as a Roman numeral character) [General Category Nl.] # * it is a combining mark [General Category Mc (see <http://www.unicode.org/versions/Unicode4.0.0/ch04.pdf>).] # * it is a non-spacing mark [General Category Mn (ditto).] # * isIdentifierIgnorable(codePoint) returns true for the character <http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isDigit(int)> # A character is a digit if its general category type, provided by # getType(codePoint), is DECIMAL_DIGIT_NUMBER. [General Category Nd.] <http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isIdentifierIgnorable(int)> # The following Unicode characters are ignorable in a Java identifier or a Unicode # identifier: # # * ISO control characters that are not whitespace # o '\u0000' through '\u0008' # o '\u000E' through '\u001B' # o '\u007F' through '\u009F' # * all characters that have the FORMAT general category value [FORMAT is General Category Cf.] <http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html#isLetter(int)> # A character is considered to be a letter if its general category type, provided # by getType(codePoint), is any of the following: # # * UPPERCASE_LETTER # * LOWERCASE_LETTER # * TITLECASE_LETTER # * MODIFIER_LETTER # * OTHER_LETTER ==== To cut a long story short, the syntax of identifiers in Java 1.5 is therefore: Keyword ::= one of abstract continue for new switch assert default if package synchronized boolean do goto private this break double implements protected throw byte else import public throws case enum instanceof return transient catch extends int short try char final interface static void class finally long strictfp volatile const float native super while Identifier ::= IdentifierChars butnot (Keyword | "true" | "false" | "null") IdentifierChars ::= JavaLetter | IdentifierChars JavaLetterOrDigit JavaLetter ::= Lu | Ll | Lt | Lm | Lo | Nl | Sc | Pc JavaLetterOrDigit ::= JavaLetter | Nd | Mn | Mc | U+0000..0008 | U+000E..001B | U+007F..009F | Cf where the two-letter terminals refer to General Categories in Unicode 4.0.0 (exactly). Note that the so-called "ignorable" characters (for which isIdentifierIgnorable(codePoint) returns true) are not ignorable; they are treated like any other identifier character. This quote from the API spec: # The following Unicode characters are ignorable in a Java identifier [...] should be ignored (no pun intended). It is contradicted by: # Two identifiers are the same only if they are identical, that is, have the # same Unicode character for each letter or digit. in the language spec. Unicode does have a concept of ignorable characters in identifiers, which is probably where this documentation bug crept in. The inclusion of U+0000 and various control characters in the set of valid identifier characters is also a dubious decision, IMHO. Note that I am not defending in any way the complexity of this definition; there's clearly no excuse for it (or for the "ignorable" documentation bug). The language spec should have been defined directly in terms of the Unicode General Categories, and then the API in terms of the language spec. They way it is done now is completely backwards. -- David Hopwood <[EMAIL PROTECTED]> -- http://mail.python.org/mailman/listinfo/python-list