[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Antoine Pitrou Sat, 13 Aug 2011 14:09:58 -0700

Antoine Pitrou <[email protected]> added the comment:

> There are occasions when you want to do string slicing, often of the form:
> 
> pos = my_str.index(x)
> endpos = my_str.index(y)
> substring = my_str[pos : endpos]
> 
> To me that suggests that if UTF-8 is used then it may be worth
> profiling to see whether caching the last 2 positions would be
> beneficial.


And/or a lookup table giving the byte offset of, say, every 16th
character. It gives you a O(1) lookup with a relatively reasonable
constant cost (you have to scan for less than 16 characters after the
lookup).

On small strings (< 256 UTF-8 bytes) the space overhead for the lookup
table would be 1/16. It could also be constructed lazily whenever more
than 2 positions are cached.

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug

Reply via email to