On 23/1/2014 10:54 μμ, Steven D'Aprano wrote:
On Thu, Jan 23, 2014 at 01:27:50PM +0000, Oscar Benjamin wrote:

Steven wrote:
With a UTF-8 implementation, won't that mean that string indexing
operations are O(N) rather than O(1)? E.g. how do you know which UTF-8
byte(s) to look at to get the character at index 42 without having to
walk the string from the start?

With an indexing cache you do this by looking it up in the cache. [...]
So there's a O(N) per-string cost if the string is
indexed at least once and an O(k) cost per attempt to index into the string.

Plus a memory cost for the cache, and increased complexity for indexing
operations.


OTOH using utf-8 internally means that encoding and decoding as utf-8 becomes
O(1) instead of O(N) and uses less memory since if you do s=b.decode('utf-8')
then s and b can share the same memory buffer.

CPython caches the UTF-8 encoding of (some? all?) strings too. It's not
clear to me if they are cached lazily or eagerly, but either way, I'm
not convinced by this argument that this is useful.

In most applications, you'll decode from UTF-8 exactly once, when you
read the string in (from a file, or the network, say), and you'll encode
it at most once, when you write it out again. Many strings won't ever be
encoded to UTF-8. I have difficulty imagining any application which will
need to repeatedly encode the same string to UTF-8, where the cost
of encoding is more significant than the cost of I/O. One possible
exception might be printing to the terminal, say at the REPL, but even
then, I don't imagine the time savings are worthwhile. I haven't
noticed the REPL in Python 3.3 being particularly more responsive than
that in 3.2. If there's a speed-up, it's invisible to the user, but at
the cost of using extra memory for the UTF-8 cache.


The FSR does this for latin-1
strings but utf-8 strings are more useful. So if you're more likely to
decode/encode from/to utf-8 than you are to index into a long string that's a
big saving.

But I don't believe that is the case for any realistic use-case. Just
about every string operation involves walking the string, which
needs to walk the string in characters (code points), not UTF-8 bytes.
At the Python layer, string indexing might be relatively uncommon, but
at the implementation layer, I think every string operation relies on
it.

But having said all this, I know that using UTF-8 internally for strings
is quite common (e.g. Haskell does it, without even an index cache, and
documents that indexing operations can be slow). CPython's FSR has
received much (in my opinion, entirely uninformed) criticism from one
vocal person in particular for not using UTF-8 internally. If PyPy goes
ahead with using UTF-8 internally, I look forward to comparing memory
and time benchmarks of string operations between CPython and PyPy.


I have to admit that due to my work (databases and data processing), i'm biased towards I/O (UTF-8 is better due to size) rather than CPU.

At least from my use cases, the most frequent operations that i do on strings are read, write, store, use them as keys in dicts, concatenate and split.

For most of above things (with the exception of split maybe?), an index cache would not be needed, and UTF-8 due to its smaller size would be faster than wide unicode encodings.

l.




_______________________________________________
pypy-dev mailing list
pypy-dev@python.org
https://mail.python.org/mailman/listinfo/pypy-dev

Reply via email to