On 1/15/2014 11:55 AM, Robin Becker wrote:

The fact that unicoders want to take over the meaning of encoding is not
relevant.

I agree with you that 'encoding' should not be limited to 'byte encoding of a (subset of) unicode characters. For instance, .jpg and .png are byte encodings of images. In the other hand, it is common in human discourse to omit qualifiers in particular contexts. 'Computer virus' gets condensed to 'virus' in computer contexts.

The problem with graphemes is that there is no fixed set of unicode graphemes. Which is to say, the effective set of graphemes is context-specific. Just limiting ourselves to English, 'fi' is usually 2 graphemes when printing to screen, but often just one when printing to paper. This is why the Unicode consortium punted 'graphemes' to 'application' code.

I'm not anti unicode, that's just an assignment of identity to some
symbols. Coding the values of the ids is a separate issue. It's my
belief that we don't need more than the byte level encoding to represent
unicode. One of the claims made for python3 unicode is that it somehow
eliminates the problems associated with other encodings eg utf8,

The claim is true for the following problems of the way-too-numerous unicode byte encodings.

Subseting: only a subset of characters can be encoded.

Shifting: the meaning of a byte depends on a preceding shift character, which might be back as the beginning of the sequence.

Varying size: the number of bytes to encode a character depends on the character.

Both of the last two problems can turn O(1) operations into O(n) operations. 3.3+ eliminates all these problems.

--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to