Re: [Python-Dev] PEP 393 decode() oddity

Serhiy Storchaka Sun, 25 Mar 2012 11:53:38 -0700

25.03.12 20:01, Antoine Pitrou написав(ла):

The general problem with decoding is that you don't know up front what
width (1, 2 or 4 bytes) is required for the result. The solution is
either to compute the width in a first pass (and decode in a second
pass), or decode in a single pass and enlarge the result on the fly
when needed. Both incur a slowdown compared to a single-size
representation.

We can significantly reduce the number of checks, using the same trickthat is used for fast checking of surrogate characters. While allcharacters < U+0100, we know that the result is a 1-byte string (asciiwhile all characters < U+0080). When meet a character >= U+0100, whileall characters < U+10000, we know that the result is the 2-byte string.As soon as we met first character >= U+10000, we work with 4-bytesstring. There will be several fast loops, the transition to the nextloop will occur after the failure in the previous one.

It's probably a measurement error on your part.


Anyone can test.

$ ./python -m timeit -s 'enc = "latin1"; import codecs; d =codecs.getdecoder(enc); x = ("\u0020" * 100000).encode(enc)' 'd(x)'

10000 loops, best of 3: 59.4 usec per loop

$ ./python -m timeit -s 'enc = "latin1"; import codecs; d =codecs.getdecoder(enc); x = ("\u0080" * 100000).encode(enc)' 'd(x)'

10000 loops, best of 3: 28.4 usec per loop

The results are fairly stable (±0.1 µsec) from run to run. It looksfunny thing.


_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 393 decode() oddity

Reply via email to