Re: [pypy-dev] PyPy 2 unicode class

Steven D'Aprano Thu, 23 Jan 2014 13:00:27 -0800

On Thu, Jan 23, 2014 at 01:27:50PM +0000, Oscar Benjamin wrote:

> Steven wrote:
> > With a UTF-8 implementation, won't that mean that string indexing
> > operations are O(N) rather than O(1)? E.g. how do you know which UTF-8
> > byte(s) to look at to get the character at index 42 without having to
> > walk the string from the start?
> 
> With an indexing cache you do this by looking it up in the cache. [...]
> So there's a O(N) per-string cost if the string is
> indexed at least once and an O(k) cost per attempt to index into the string.


Plus a memory cost for the cache, and increased complexity for indexing 
operations.


> OTOH using utf-8 internally means that encoding and decoding as utf-8 becomes
> O(1) instead of O(N) and uses less memory since if you do s=b.decode('utf-8')
> then s and b can share the same memory buffer.

CPython caches the UTF-8 encoding of (some? all?) strings too. It's not 
clear to me if they are cached lazily or eagerly, but either way, I'm 
not convinced by this argument that this is useful.

In most applications, you'll decode from UTF-8 exactly once, when you 
read the string in (from a file, or the network, say), and you'll encode 
it at most once, when you write it out again. Many strings won't ever be 
encoded to UTF-8. I have difficulty imagining any application which will 
need to repeatedly encode the same string to UTF-8, where the cost 
of encoding is more significant than the cost of I/O. One possible 
exception might be printing to the terminal, say at the REPL, but even 
then, I don't imagine the time savings are worthwhile. I haven't 
noticed the REPL in Python 3.3 being particularly more responsive than 
that in 3.2. If there's a speed-up, it's invisible to the user, but at 
the cost of using extra memory for the UTF-8 cache.


> The FSR does this for latin-1
> strings but utf-8 strings are more useful. So if you're more likely to
> decode/encode from/to utf-8 than you are to index into a long string that's a
> big saving.

But I don't believe that is the case for any realistic use-case. Just 
about every string operation involves walking the string, which 
needs to walk the string in characters (code points), not UTF-8 bytes. 
At the Python layer, string indexing might be relatively uncommon, but 
at the implementation layer, I think every string operation relies on 
it.

But having said all this, I know that using UTF-8 internally for strings 
is quite common (e.g. Haskell does it, without even an index cache, and 
documents that indexing operations can be slow). CPython's FSR has 
received much (in my opinion, entirely uninformed) criticism from one 
vocal person in particular for not using UTF-8 internally. If PyPy goes 
ahead with using UTF-8 internally, I look forward to comparing memory 
and time benchmarks of string operations between CPython and PyPy.


-- 
Steven
_______________________________________________
pypy-dev mailing list
pypy-dev@python.org
https://mail.python.org/mailman/listinfo/pypy-dev

Re: [pypy-dev] PyPy 2 unicode class

Reply via email to