Ezio Melotti added the comment:

Given that high surrogates are U+D800..U+DBFF, and low ones are U+DC00..U+DFFF, 
'([^\ud800-\udbff]|\A)[\udc00-\udfff]([^\udc00-\udfff]|\Z)' means "a low 
surrogates, preceded by either an high one or line beginning, and followed by 
another low one or line end".

PEP 838 says "With this PEP, non-decodable bytes >= 128 will be represented as 
lone surrogate codes U+DC80..U+DCFF".

If I change the regex to _has_surrogates = 
re.compile('[\udc80-\udcff]').search, the tests still pass but there's no 
improvement on startup time (note: the previous regex was matching all the 
surrogates in this range too, however I'm not sure how well this is tested).

If I change the implementation with
_pep383_surrogates = set(map(chr, range(0xDC80, 0xDCFF+1)))
def _has_surrogates(s):
    return any(c in _pep383_surrogates for c in s)

the tests still pass and the startup is ~15ms faster here:

$ time ./python -m issue11454_imp2
[68837 refs]

real    0m0.305s
user    0m0.288s
sys     0m0.012s

However using this function instead of the regex is ~10x slower at runtime.  
Using the shorter regex is about ~7x faster, but there are no improvements on 
the startup time.
Assuming the shorter regex is correct, it can still be called inside a function 
or used with functools.partial.  This will result in a improved startup time 
and a ~2x improvement on runtime (so it's a win-win).
See attached patch for benchmarks.

This is a sample result:
 17.01 usec/pass  <- re.compile(current_regex).search
  2.20 usec/pass  <- re.compile(short_regex).search
148.18 usec/pass  <- return any(c in surrogates for c in s)
106.35 usec/pass  <- for c in s: if c in surrogates: return True
  8.40 usec/pass  <- return re.search(short_regex, s)
  8.20 usec/pass  <- functools.partial(re.search, short_regex)

----------
Added file: http://bugs.python.org/file27203/issue11454_surr1.py

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue11454>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to