Re: The Cost of Dynamism (was Re: Pyhon 2.x or 3.x, which is faster?)

BartC Mon, 21 Mar 2016 10:23:05 -0700

On 21/03/2016 12:59, Chris Angelico wrote:

On Mon, Mar 21, 2016 at 11:34 PM, BartC <b...@freeuk.com> wrote:

For Python I would have used a table of 0..255 functions, indexed by the
ord() code of each character. So all 52 letter codes map to the same
name-handling function. (No Dict is needed at this point.)


Once again, you forget that there are not 256 characters - there are
1114112. (Give or take.)

The original code for this test expected the data to be a series ofbytes, mostly ASCII. Any Unicode in the input would be expected to be inthe form of UTF-8.

Since this was designed to tokenise C, I don't think C supports Unicodeexcept in comments within the code, and within string literals. Forthose purposes, it is not necessary to do anything with UTF-8 escapesequences except ignore them or process them unchanged. (I'm ignoring'wide' string and char literals).

But it doesn't make any difference: you process a byte at a time, andtrap codes C0 to FF which is the start of an escape sequence.

I understand that Python 3 doing text mode files can do this expansionautomatically, and give you a string that might contain code pointsabove 127. That's not a problem: you can still treat the first 128code-points exactly as I have, and have special treatment for the rest.But you /will/ need to know if data is a raw UTF-8 stream, or has beenalready processed into Unicode.

(I'm taking about 'top-level' character dispatch where you're lookingfor the start of a token.)

Note that my test data was 5,964,784 bytes on disk, of which 14 hadvalues above 127: probably 3 or 4 Unicode characters, and most likely incomments.

Given that 99.9998% of input byte data is ASCII, and 99.9999% ofcharacters (in this data), is it unreasonable to concentrate on that0..127 range?



--
Bartc

--
https://mail.python.org/mailman/listinfo/python-list

Re: The Cost of Dynamism (was Re: Pyhon 2.x or 3.x, which is faster?)

Reply via email to