Re: [pypy-dev] PyPy 2 unicode class

Nathan Hurst Thu, 23 Jan 2014 15:24:39 -0800

On Thu, Jan 23, 2014 at 10:45:25PM +0200, Elefterios Stamatogiannakis wrote:
> >But having said all this, I know that using UTF-8 internally for strings
> >is quite common (e.g. Haskell does it, without even an index cache, and
> >documents that indexing operations can be slow). CPython's FSR has
> >received much (in my opinion, entirely uninformed) criticism from one
> >vocal person in particular for not using UTF-8 internally. If PyPy goes
> >ahead with using UTF-8 internally, I look forward to comparing memory
> >and time benchmarks of string operations between CPython and PyPy.
> >
> 
> I have to admit that due to my work (databases and data processing),
> i'm biased towards I/O (UTF-8 is better due to size) rather than
> CPU.
> 
> At least from my use cases, the most frequent operations that i do
> on strings are read, write, store, use them as keys in dicts,
> concatenate and split.
> 
> For most of above things (with the exception of split maybe?), an
> index cache would not be needed, and UTF-8 due to its smaller size
> would be faster than wide unicode encodings.


I hear Steven's points, but my experience matches Elefterios' -
smaller data is faster[1].  I'll also note that although many string
processing algorithms can be written in terms of indexing, many(most?)
are actually stream processing algorithms which do not actually need
efficient character offset to/from byte offset calculations.  For
example, split works by walking the entire string in a single pass
outputing substrings as it goes.

regards,
njh
[1] which suggests that lz77ing longer strings by default is not a
terrible idea.
_______________________________________________
pypy-dev mailing list
pypy-dev@python.org
https://mail.python.org/mailman/listinfo/pypy-dev

Re: [pypy-dev] PyPy 2 unicode class

Reply via email to