Yes sure, I'm aware of that :-)

The problem only shows up with "start" and "end" parameters being used

On Mon, Mar 6, 2017 at 11:13 AM, Armin Rigo <armin.r...@gmail.com> wrote:
> Hi Maciej,
>
> On 5 March 2017 at 20:24, Maciej Fijalkowski <fij...@gmail.com> wrote:
>> This is checking for spaces in unicode (so it's known to be valid utf8)
>
> Ok, then you might have missed another property of UTF-8: when you
> check for "being a substring" in UTF-8, you don't need to do any
> decoding.  Instead you only need to check "being a substring" with the
> two encoded UTF-8 strings.  This always works as expected, i.e. you
> can never get a positive answer by chance.  So for example:
>
>     x in y   can be implemented as   x._utf8 in y._utf8
>
> and in this case, you can find spaces in a unicode string just by
> searching for the 10 byte patterns that are spaces-encoded-as-UTF-8
> (11 if you also count '\n\r' as one such pattern).
>
> That's also how the 're' module could be rewritten to directly handle
> UTF-8 strings, instead of decoding it first.
>
>
> A bientôt,
>
> Armin.
_______________________________________________
pypy-dev mailing list
pypy-dev@python.org
https://mail.python.org/mailman/listinfo/pypy-dev

Reply via email to