On Thu, Jan 23, 2014 at 01:27:50PM +0000, Oscar Benjamin wrote: > Steven wrote: > > With a UTF-8 implementation, won't that mean that string indexing > > operations are O(N) rather than O(1)? E.g. how do you know which UTF-8 > > byte(s) to look at to get the character at index 42 without having to > > walk the string from the start? > > With an indexing cache you do this by looking it up in the cache. [...] > So there's a O(N) per-string cost if the string is > indexed at least once and an O(k) cost per attempt to index into the string.
Plus a memory cost for the cache, and increased complexity for indexing operations. > OTOH using utf-8 internally means that encoding and decoding as utf-8 becomes > O(1) instead of O(N) and uses less memory since if you do s=b.decode('utf-8') > then s and b can share the same memory buffer. CPython caches the UTF-8 encoding of (some? all?) strings too. It's not clear to me if they are cached lazily or eagerly, but either way, I'm not convinced by this argument that this is useful. In most applications, you'll decode from UTF-8 exactly once, when you read the string in (from a file, or the network, say), and you'll encode it at most once, when you write it out again. Many strings won't ever be encoded to UTF-8. I have difficulty imagining any application which will need to repeatedly encode the same string to UTF-8, where the cost of encoding is more significant than the cost of I/O. One possible exception might be printing to the terminal, say at the REPL, but even then, I don't imagine the time savings are worthwhile. I haven't noticed the REPL in Python 3.3 being particularly more responsive than that in 3.2. If there's a speed-up, it's invisible to the user, but at the cost of using extra memory for the UTF-8 cache. > The FSR does this for latin-1 > strings but utf-8 strings are more useful. So if you're more likely to > decode/encode from/to utf-8 than you are to index into a long string that's a > big saving. But I don't believe that is the case for any realistic use-case. Just about every string operation involves walking the string, which needs to walk the string in characters (code points), not UTF-8 bytes. At the Python layer, string indexing might be relatively uncommon, but at the implementation layer, I think every string operation relies on it. But having said all this, I know that using UTF-8 internally for strings is quite common (e.g. Haskell does it, without even an index cache, and documents that indexing operations can be slow). CPython's FSR has received much (in my opinion, entirely uninformed) criticism from one vocal person in particular for not using UTF-8 internally. If PyPy goes ahead with using UTF-8 internally, I look forward to comparing memory and time benchmarks of string operations between CPython and PyPy. -- Steven _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev