Yes sure, I'm aware of that :-) The problem only shows up with "start" and "end" parameters being used
On Mon, Mar 6, 2017 at 11:13 AM, Armin Rigo <armin.r...@gmail.com> wrote: > Hi Maciej, > > On 5 March 2017 at 20:24, Maciej Fijalkowski <fij...@gmail.com> wrote: >> This is checking for spaces in unicode (so it's known to be valid utf8) > > Ok, then you might have missed another property of UTF-8: when you > check for "being a substring" in UTF-8, you don't need to do any > decoding. Instead you only need to check "being a substring" with the > two encoded UTF-8 strings. This always works as expected, i.e. you > can never get a positive answer by chance. So for example: > > x in y can be implemented as x._utf8 in y._utf8 > > and in this case, you can find spaces in a unicode string just by > searching for the 10 byte patterns that are spaces-encoded-as-UTF-8 > (11 if you also count '\n\r' as one such pattern). > > That's also how the 're' module could be rewritten to directly handle > UTF-8 strings, instead of decoding it first. > > > A bientôt, > > Armin. _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev