Ezio Melotti added the comment: Given that high surrogates are U+D800..U+DBFF, and low ones are U+DC00..U+DFFF, '([^\ud800-\udbff]|\A)[\udc00-\udfff]([^\udc00-\udfff]|\Z)' means "a low surrogates, preceded by either an high one or line beginning, and followed by another low one or line end".
PEP 838 says "With this PEP, non-decodable bytes >= 128 will be represented as lone surrogate codes U+DC80..U+DCFF". If I change the regex to _has_surrogates = re.compile('[\udc80-\udcff]').search, the tests still pass but there's no improvement on startup time (note: the previous regex was matching all the surrogates in this range too, however I'm not sure how well this is tested). If I change the implementation with _pep383_surrogates = set(map(chr, range(0xDC80, 0xDCFF+1))) def _has_surrogates(s): return any(c in _pep383_surrogates for c in s) the tests still pass and the startup is ~15ms faster here: $ time ./python -m issue11454_imp2 [68837 refs] real 0m0.305s user 0m0.288s sys 0m0.012s However using this function instead of the regex is ~10x slower at runtime. Using the shorter regex is about ~7x faster, but there are no improvements on the startup time. Assuming the shorter regex is correct, it can still be called inside a function or used with functools.partial. This will result in a improved startup time and a ~2x improvement on runtime (so it's a win-win). See attached patch for benchmarks. This is a sample result: 17.01 usec/pass <- re.compile(current_regex).search 2.20 usec/pass <- re.compile(short_regex).search 148.18 usec/pass <- return any(c in surrogates for c in s) 106.35 usec/pass <- for c in s: if c in surrogates: return True 8.40 usec/pass <- return re.search(short_regex, s) 8.20 usec/pass <- functools.partial(re.search, short_regex) ---------- Added file: http://bugs.python.org/file27203/issue11454_surr1.py _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue11454> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com