On 23 January 2014 20:54, Steven D'Aprano <st...@pearwood.info> wrote: > On Thu, Jan 23, 2014 at 01:27:50PM +0000, Oscar Benjamin wrote: > >> Steven wrote: >> > With a UTF-8 implementation, won't that mean that string indexing >> > operations are O(N) rather than O(1)? E.g. how do you know which UTF-8 >> > byte(s) to look at to get the character at index 42 without having to >> > walk the string from the start? >> >> With an indexing cache you do this by looking it up in the cache. [...] >> So there's a O(N) per-string cost if the string is >> indexed at least once and an O(k) cost per attempt to index into the string. > > Plus a memory cost for the cache, and increased complexity for indexing > operations.
There is an O(n/k) memory cost for the cache, yes and indexing has an O(k) CPU cost. >> OTOH using utf-8 internally means that encoding and decoding as utf-8 becomes >> O(1) instead of O(N) and uses less memory since if you do s=b.decode('utf-8') >> then s and b can share the same memory buffer. > > CPython caches the UTF-8 encoding of (some? all?) strings too. It's not > clear to me if they are cached lazily or eagerly, but either way, I'm > not convinced by this argument that this is useful. There's a difference between caching the utf-8 string and reusing it. If you cache the utf-8 string you're holding two memory buffers: one in FSR format and one in utf-8 format. If your implementation uses utf-8 then you can use one buffer for both the str and the corresponding bytes object. > In most applications, you'll decode from UTF-8 exactly once, when you > read the string in (from a file, or the network, say), and you'll encode > it at most once, when you write it out again. Many strings won't ever be > encoded to UTF-8. That's true but actually the concept of encoding/decoding is a conceptual error at the implementation level. Really all you ever do is re-encode. So why not choose the most common input/output format as the standard internal format if it saves conversion time? > I have difficulty imagining any application which will > need to repeatedly encode the same string to UTF-8, where the cost > of encoding is more significant than the cost of I/O. It's not about repeatedly encoding the same string. It's about the fact that you always need to encode or decode the string at least once. If that's the case then minimising the cost of that one guaranteed encode/decode operation is a good thing. <snip> > >> The FSR does this for latin-1 >> strings but utf-8 strings are more useful. So if you're more likely to >> decode/encode from/to utf-8 than you are to index into a long string that's a >> big saving. > > But I don't believe that is the case for any realistic use-case. Just > about every string operation involves walking the string, which > needs to walk the string in characters (code points), not UTF-8 bytes. > At the Python layer, string indexing might be relatively uncommon, but > at the implementation layer, I think every string operation relies on > it. Okay so here's a review of the Python str methods broken down into relevant categories. The first is all the methods that can operate on utf-8 strings by treating them as arbitrary byte strings. For example concatenating two utf-8 strings is equivalent to concatenating the corresponding abstract unicode character strings. I make the assumption here that the number of characters (not bytes) in the string is known (e.g. for ljust): join, ljust, partition, replace, rpartition, rjust, count, endswith, expandtabs, format, format_map, startswith, zfill, __add__, __contains__, __eq__, __ge__, __gt__, __hash__, __le__, __len__, __lt__, __mod__, __mul__, __ne__, __rmul__, __rmod__ The second group is those methods that must understand that the string is encoded in utf-8 but can walk through without needing to jump to an index in the middle of the string: isalnum, isalpha, isdecimal, isdigit, isidentifier, islower, isnumeric, isprintable, isspace, istitle, isupper, lstrip, lower, maketrans, casefold, center, encode, rsplit, rstrip, split, splitlines, strip, swapcase, title, translate, upper, __format__, __iter__ Finally here are the methods that need to be able to jump to an arbitrary character index within the string: find, index, rfind, rindex, __getitem__ In my own use I think that I would most commonly use (in)equality, concatenation, interpolation and formatting rather than indexing or slicing (__getitem__). None of these operation requires using the index cache. > But having said all this, I know that using UTF-8 internally for strings > is quite common (e.g. Haskell does it, without even an index cache, and > documents that indexing operations can be slow). I find that Haskell is generally a bit slow with strings, utf-8 or otherwise. Haskell naturally lends itself to stream type processing so it's understandable that they would see efficient indexing as less important. Using linked-lists for strings is probably never going to be as efficient as well-implemented atomic string types. > CPython's FSR has > received much (in my opinion, entirely uninformed) criticism from one > vocal person in particular for not using UTF-8 internally. I feel like you're putting a jinx on this list. Please don't encourage the "vocal person" to come here! Oscar _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev