Guido van Rossum wrote: >> > But Unicode has many alternative sets digits for which "isdigit" is >> true. >> >> You mean, the Python isdigit() method? Sure, but the tokenizer uses >> the C isdigit function, which gives true only for [0-9]. > > Isn't that because it's only defined on 8-bit characters though?
No: the C standard requires that isdigit is true if and only if the character is from [0-9]; it also requires that the digits must have subsequent ordinals in the "execution character set", and that they must be represented using a single char (rather than requiring multiple bytes). Currently, the tokenizer operates on UTF-8, which is multi-byte, but still, isdigit works "correctly". > And if we're talking about Unicode, why shouldn't we use the Unicode > isdigit()? After all you were talking about the Unicode consortium's > rules for which characters can be part of identifiers. The tokenizer doesn't use isdigit() to determine what an identifier is; it uses isalnum(). The parser uses isdigit only to determine what a number literal is - I don't propose to change that. The Unicode consortium rules are listed here: http://www.unicode.org/reports/tr31/ This recommendation mentions two classes ID_Start and ID_Continue: ID_Start: Uppercase letters, lowercase letters, titlecase letters, modifier letters, other letters, letter numbers, stability extensions ID_Continue: All of the above, plus nonspacing marks, spacing combining marks, decimal numbers, connector punctuations, stability extensions. These are also known simply as Identifier Characters, since they are a superset of the ID_Start. The set of ID_Start characters minus the ID_Continue characters are known as ID_Only_ Continue characters. In the implementation, a compact table should be used to determine whether a character is ID_Start or ID_Continue, instead of calling some library function. There are some problems with the UAX#31 definitions IIRC, although I forgot the exact details (might be that the underscore is missing, or that the dollar is allowed); the definitions should be adjusted so that they match the current language for ASCII. >> FWIW, POSIX >> allows 6 alternative characters to be defined as hexdigits for >> isxdigit, so the tokenizer shouldn't really use isxdigit for >> hexadecimal literals. > > I think if we're talking Unicode, POSIX is irrelevant though, right? What I'm saying is that the tokenizer currently uses isxdigit; it should stop doing so (whether or not Unicode identifiers become part of the language). As source code would (still) be parsed as UTF-8, isxdigit would continue to "work", but definitely shouldn't be used anymore. > But we force the locale to be C, right? I've never heard of someone > who managed to type non-ASCII letters into identifiers, and I'm sure > it would've been reported as a bug. Python 2.3.5 (#2, Mar 6 2006, 10:12:24) [GCC 4.0.3 20060304 (prerelease) (Debian 4.0.2-10)] on linux2 Type "help", "copyright", "credits" or "license" for more information. py> import locale py> locale.setlocale(locale.LC_ALL, "") '[EMAIL PROTECTED]' py> löwis=1 py> print löwis 1 We don't force the C locale - we just happen to start with it initially. We shouldn't change it later, as that isn't thread-safe. Nobody reported it, because people just don't try to do that, except in interactive mode. >> I can't see why the Unicode notion of digits should affect the >> language specification in any way. The notion of digit is only >> used to define what number literals are, and I don't propose >> to change the lexical rules for number literals - I propose >> to change the rules for identifiers. > > Well identifiers can contain digits too. Sure. But they dont' "count" as digits then, lexically - they are ID_Continue characters (which is a superset of digits). So what we need is to extend the definition of ID_Continue, not the definition of digits. > I do think that *eventually* we'll have to support this. But I don't > think Python needs to lead the pack here; I don't think the tools are > ready yet. Python doesn't really lead here. The C family of languages (C, C++, Java, C#) all have Unicode identifiers, so there is plenty of experience. Primarily, the experience is that the feature isn't used much, because of obstacles I think we can overcome (primarily, that all these languages make the source encoding implementation-defined; we don't, as we put the source encoding into the source file). Regards, Martin _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
