On Nov 24, 2010, at 12:07 AM, Stephen J. Turnbull wrote:
> Or you can give user programs memory indicies, and enjoy the fun as
> the poor developers do things like "pos += 1" which works fine on
> the ASCII data they have lying around, then wonder why they get
> Unicode errors when they take substrings.


a) You seem to be hung up implementation details of emacs. But yes, positions 
should be stored as an byte offset into the utf8 string. NOT as number of 
codepoints since the beginning of the string. Probably you want it to be 
somewhat opaque, so that you actually have to specify whether you wanted to go 
to +1 byte, codepoint, or grapheme.

b) Those poor developers are *already* screwed if they're using pos += 1 when 
pos is a codepoint index and they then take a substring based on that! They 
will get half a character when the string contains combining characters...

Pretending that "codepoints" are a useful abstraction just makes poor 
developers get by without doing the correct thing (incrementing to the next 
grapheme boundary) for a little bit longer. But once you [the language 
implementor] are providing correct abstractions for grapheme movement, it's 
just as easy to also provide an abstraction for codepoint movement, and make 
your low-level implementation of the iterator object be a byte-offset into a 
UTF8 buffer.

James
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to