New submission from Steven D'Aprano <steve+pyt...@pearwood.info>:
I think there is an opportunity to speed up some unicode normalisations significantly. In 3.9 at least, the normalisation appears to be dependent on the length of the string: >>> setup="from unicodedata import normalize; s = 'reverse'" >>> t1 = Timer('normalize("NFKC", s)', setup=setup) >>> setup="from unicodedata import normalize; s = 'reverse'*1000" >>> t2 = Timer('normalize("NFKC", s)', setup=setup) >>> >>> min(t1.repeat(repeat=7)) 0.04854234401136637 >>> min(t2.repeat(repeat=7)) 9.98313440399943 But ASCII strings are always in normalised form, for all four normalisation forms. In CPython, with PEP 393 (Flexible String Representation), it should be a constant-time operation to detect whether a string is pure ASCII, and avoid scanning the string or attempting the normalisation. ---------- components: Unicode messages: 400192 nosy: ezio.melotti, steven.daprano, vstinner priority: normal severity: normal status: open title: Speed up unicode normalization of ASCII strings type: enhancement versions: Python 3.11 _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue44987> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com