Re: [pypy-dev] Speeds of various utf8 operations

Armin Rigo Sun, 05 Mar 2017 23:14:32 -0800

Hi Maciej,

On 5 March 2017 at 20:24, Maciej Fijalkowski <fij...@gmail.com> wrote:
> This is checking for spaces in unicode (so it's known to be valid utf8)


Ok, then you might have missed another property of UTF-8: when you
check for "being a substring" in UTF-8, you don't need to do any
decoding.  Instead you only need to check "being a substring" with the
two encoded UTF-8 strings.  This always works as expected, i.e. you
can never get a positive answer by chance.  So for example:

    x in y   can be implemented as   x._utf8 in y._utf8

and in this case, you can find spaces in a unicode string just by
searching for the 10 byte patterns that are spaces-encoded-as-UTF-8
(11 if you also count '\n\r' as one such pattern).

That's also how the 're' module could be rewritten to directly handle
UTF-8 strings, instead of decoding it first.


A bientôt,

Armin.
_______________________________________________
pypy-dev mailing list
pypy-dev@python.org
https://mail.python.org/mailman/listinfo/pypy-dev

Re: [pypy-dev] Speeds of various utf8 operations

Reply via email to