> On Jul 16, 2018, at 3:28 PM, Terry Reedy <tjre...@udel.edu> wrote:
> 
>> On 7/16/2018 1:11 PM, Richard Damon wrote:
>> 
>> Many consider that UTF-32 is a variable-width encoding because of the 
>> combining characters. It can take multiple ‘codepoints’ to define what 
>> should be a single ‘character’ for display.
> 
> I hope you realize that this is not the standard meaning of 'variable-width 
> encoding', which is 'variable number of bytes for a codepoint'.  UTF-16 and 
> UTF-8 are variable width.  If one expands the definition enough, Ascii is 
> 'variable width' because 'fi' is two bytes, or more realistically, because <= 
> and >= are two bytes instead of one (as they can be in Unicode!).
> 
> If one is using a broader definition than usual, it is clearer to say so.
> 
> -- 
> Terry Jan Reedy
> 

You are defining a variable/fixed width codepoint set. Many others want to deal 
with CHARACTER sets. The Unicode consortium agrees that a code point is not 
necessarily a character (which is one reason they came up with the term). When 
actually trying to do work with text strings, the fact that some codepoints are 
combining codes that need to ‘stick’ to their mate becomes important. One of 
the claimed advantages of fixed width character set encodings is that you 
aren’t supposed to need to worry about breaking strings in two, but that 
doesn’t work in Unicode, you need to make sure you aren’t breaking a combining 
sequence.

Even worse, Unicode really needs arbitrary look back to render substrings 
because it uses shift codes for things like left-to-right/right-to-left 
rendering control.

This doesn’t mean that UTF-32 is an awful system, just that it isn’t the 
magical cure that some were hoping for.
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to