On Tue, Mar 22, 2016 at 12:48 AM, Steven D'Aprano <st...@pearwood.info> wrote: > On Mon, 21 Mar 2016 11:59 pm, Chris Angelico wrote: > >> On Mon, Mar 21, 2016 at 11:34 PM, BartC <b...@freeuk.com> wrote: >>> For Python I would have used a table of 0..255 functions, indexed by the >>> ord() code of each character. So all 52 letter codes map to the same >>> name-handling function. (No Dict is needed at this point.) >>> >> >> Once again, you forget that there are not 256 characters - there are >> 1114112. (Give or take.) > > Pardon me, do I understand you correctly? You're saying that the C parser is > Unicode-aware and allows you to use Unicode in C source code? Because > Bart's test is for a (simplified?) C tokeniser, and expecting his tokeniser > to support character sets that C does not would be, well, Not Cricket, my > good chap.
We nutted part of this out earlier in the thread; Python 3.x code is, and any other modern language should be, defined to have Unicode source. (And yes, MRAB, I'm aware that only a tiny fraction of codepoints are defined; it's still a lot more than 256, and going to make for a far larger lookup table.) While you could plausibly define that your source code consists only of printable ASCII characters (eg 09,10,13,32-126), it is an extremely bad idea to declare that it has 256 possibilities - you're shackling your language to a parser definition that includes both more and less than people will expect. ChrisA -- https://mail.python.org/mailman/listinfo/python-list