Hi Maciej, On 5 March 2017 at 20:24, Maciej Fijalkowski <fij...@gmail.com> wrote: > This is checking for spaces in unicode (so it's known to be valid utf8)
Ok, then you might have missed another property of UTF-8: when you check for "being a substring" in UTF-8, you don't need to do any decoding. Instead you only need to check "being a substring" with the two encoded UTF-8 strings. This always works as expected, i.e. you can never get a positive answer by chance. So for example: x in y can be implemented as x._utf8 in y._utf8 and in this case, you can find spaces in a unicode string just by searching for the 10 byte patterns that are spaces-encoded-as-UTF-8 (11 if you also count '\n\r' as one such pattern). That's also how the 're' module could be rewritten to directly handle UTF-8 strings, instead of decoding it first. A bientôt, Armin. _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev