> On Jul 16, 2018, at 3:28 PM, Terry Reedy <tjre...@udel.edu> wrote: > >> On 7/16/2018 1:11 PM, Richard Damon wrote: >> >> Many consider that UTF-32 is a variable-width encoding because of the >> combining characters. It can take multiple ‘codepoints’ to define what >> should be a single ‘character’ for display. > > I hope you realize that this is not the standard meaning of 'variable-width > encoding', which is 'variable number of bytes for a codepoint'. UTF-16 and > UTF-8 are variable width. If one expands the definition enough, Ascii is > 'variable width' because 'fi' is two bytes, or more realistically, because <= > and >= are two bytes instead of one (as they can be in Unicode!). > > If one is using a broader definition than usual, it is clearer to say so. > > -- > Terry Jan Reedy >
You are defining a variable/fixed width codepoint set. Many others want to deal with CHARACTER sets. The Unicode consortium agrees that a code point is not necessarily a character (which is one reason they came up with the term). When actually trying to do work with text strings, the fact that some codepoints are combining codes that need to ‘stick’ to their mate becomes important. One of the claimed advantages of fixed width character set encodings is that you aren’t supposed to need to worry about breaking strings in two, but that doesn’t work in Unicode, you need to make sure you aren’t breaking a combining sequence. Even worse, Unicode really needs arbitrary look back to render substrings because it uses shift codes for things like left-to-right/right-to-left rendering control. This doesn’t mean that UTF-32 is an awful system, just that it isn’t the magical cure that some were hoping for. -- https://mail.python.org/mailman/listinfo/python-list