Hi Maciej,
On 5 March 2017 at 20:24, Maciej Fijalkowski <[email protected]> wrote:
> This is checking for spaces in unicode (so it's known to be valid utf8)
Ok, then you might have missed another property of UTF-8: when you
check for "being a substring" in UTF-8, you don't need to do any
decoding. Instead you only need to check "being a substring" with the
two encoded UTF-8 strings. This always works as expected, i.e. you
can never get a positive answer by chance. So for example:
x in y can be implemented as x._utf8 in y._utf8
and in this case, you can find spaces in a unicode string just by
searching for the 10 byte patterns that are spaces-encoded-as-UTF-8
(11 if you also count '\n\r' as one such pattern).
That's also how the 're' module could be rewritten to directly handle
UTF-8 strings, instead of decoding it first.
A bientôt,
Armin.
_______________________________________________
pypy-dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/pypy-dev