On Thu, Jan 23, 2014 at 10:45:25PM +0200, Elefterios Stamatogiannakis wrote: > >But having said all this, I know that using UTF-8 internally for strings > >is quite common (e.g. Haskell does it, without even an index cache, and > >documents that indexing operations can be slow). CPython's FSR has > >received much (in my opinion, entirely uninformed) criticism from one > >vocal person in particular for not using UTF-8 internally. If PyPy goes > >ahead with using UTF-8 internally, I look forward to comparing memory > >and time benchmarks of string operations between CPython and PyPy. > > > > I have to admit that due to my work (databases and data processing), > i'm biased towards I/O (UTF-8 is better due to size) rather than > CPU. > > At least from my use cases, the most frequent operations that i do > on strings are read, write, store, use them as keys in dicts, > concatenate and split. > > For most of above things (with the exception of split maybe?), an > index cache would not be needed, and UTF-8 due to its smaller size > would be faster than wide unicode encodings.
I hear Steven's points, but my experience matches Elefterios' - smaller data is faster[1]. I'll also note that although many string processing algorithms can be written in terms of indexing, many(most?) are actually stream processing algorithms which do not actually need efficient character offset to/from byte offset calculations. For example, split works by walking the entire string in a single pass outputing substrings as it goes. regards, njh [1] which suggests that lz77ing longer strings by default is not a terrible idea. _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev