On 21/03/2016 12:59, Chris Angelico wrote:
On Mon, Mar 21, 2016 at 11:34 PM, BartC <b...@freeuk.com> wrote:
For Python I would have used a table of 0..255 functions, indexed by the
ord() code of each character. So all 52 letter codes map to the same
name-handling function. (No Dict is needed at this point.)


Once again, you forget that there are not 256 characters - there are
1114112. (Give or take.)

The original code for this test expected the data to be a series of bytes, mostly ASCII. Any Unicode in the input would be expected to be in the form of UTF-8.

Since this was designed to tokenise C, I don't think C supports Unicode except in comments within the code, and within string literals. For those purposes, it is not necessary to do anything with UTF-8 escape sequences except ignore them or process them unchanged. (I'm ignoring 'wide' string and char literals).

But it doesn't make any difference: you process a byte at a time, and trap codes C0 to FF which is the start of an escape sequence.

I understand that Python 3 doing text mode files can do this expansion automatically, and give you a string that might contain code points above 127. That's not a problem: you can still treat the first 128 code-points exactly as I have, and have special treatment for the rest. But you /will/ need to know if data is a raw UTF-8 stream, or has been already processed into Unicode.

(I'm taking about 'top-level' character dispatch where you're looking for the start of a token.)

Note that my test data was 5,964,784 bytes on disk, of which 14 had values above 127: probably 3 or 4 Unicode characters, and most likely in comments.

Given that 99.9998% of input byte data is ASCII, and 99.9999% of characters (in this data), is it unreasonable to concentrate on that 0..127 range?


--
Bartc

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to