On 25/08/2012 10:46, Frank Millman wrote:
On 25/08/2012 10:58, Mark Lawrence wrote:
On 25/08/2012 08:27, wxjmfa...@gmail.com wrote:

Unicode design: a flat table of code points, where all code
points are "equals".
As soon as one attempts to escape from this rule, one has to
"pay" for it.
The creator of this machinery (flexible string representation)
can not even benefit from it in his native language (I think
I'm correctly informed).

Hint: Google -> "Das grosse Eszett"

jmf


It's Saturday morning, I'm stone cold sober, had a good sleep and I'm
still baffled as to the point if any.  Could someone please enlightem me?


Here's what I think he is saying. I am posting this to test the water. I
am also confused, and if I have got it wrong hopefully someone will
correct me.

In python 3.3, unicode strings are now stored as follows -
   if all characters can be represented by 1 byte, the entire string is
composed of 1-byte characters
   else if all characters can be represented by 1 or 2 bytea, the entire
string is composed of 2-byte characters
   else the entire string is composed of 4-byte characters

There is an overhead in making this choice, to detect the lowest number
of bytes required.

jmfauth believes that this only benefits 'english-speaking' users, as
the rest of the world will tend to have strings where at least one
character requires 2 or 4 bytes. So they incur the overhead, without
getting any benefit.

Therefore, I think he is saying that he would have preferred that python
standardise on 4-byte characters, on the grounds that the saving in
memory does not justify the performance overhead.

Frank Millman



I thought Terry Reedy had shot down any claims about performance overhead, and that the memory savings in many cases must be substantial and therefore worthwhile. Or have I misread something? Or what?

--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to