I mentioned this in another thread as an aside in the middle of the email, but I thought I'd put it out here at the top:
It should be considered whether formatting characters should be ignored. And if so, which list of properties should be used for that. I notice that the excerpt from the C# standard says: > * 4 Any formatting-characters are removed. I don't know what they mean by that, but I'm going to guess characters in the Cf class. However, UAX #31 says: > 2.2 Layout and Format Control Characters > > Certain Unicode characters are used to control joining behavior, > bidirectional ordering control, and alternative formats for > display. These have the General_Category value of Cf. Unlike space > characters or other delimiters, they do not indicate word, line, or > other unit boundaries. > > While it is possible to ignore these characters in determining > identifiers, the recommendation is to not ignore them and to not > permit them in identifiers except in special cases. This is because > of the possibility for confusion between two visually identical > strings; see [UTR36]. Some possible exceptions are the ZWJ and ZWNJ > in certain contexts, such as between certain characters in Indic > words. It doesn't seem to me that an attack vector here is particularly relevant, so perhaps going along with C# and ignoring Cf characters in the source code might be a good idea. But I do notice that Unicode 4.0.1 and earlier used to recommend ignoring formatting characters in identifiers (Ch 5 of the book), so that might be where C# got it from. So, maybe it's better to keep the status quo, and not allow Cf characters, unless someone comes up with a particular need for doing so. Hm, I think I've convinced myself of that now. :) James _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
