On Fri, Jul 21, 2017 at 4:34 PM, Steve D'Aprano <steve+pyt...@pearwood.info> wrote: > On Fri, 21 Jul 2017 01:43 pm, Chris Angelico wrote: > >> Strings with all code >> points on the BMP and no combining characters are still able to be >> represented as they are today, again with the empty secondary array. > > I presume that since the problem we're trying to solve here is that certain > characters have two representations, this format will automatically decompose > strings. Otherwise, it doesn't really solve the problems with diacritics, > where > a single human-readable character like é or ö has two distinct, and non-equal, > representations. > > But if it does, then every string with a diacritic (i.e. most Western European > text, if not Eastern European as well) will need combining characters. > > If this *doesn't* decompose the strings, then what problem is it actually > solving?
I'm honestly not sure, though I had been assuming that it was capable of representing composed OR decomposed strings. If it does decompose everything, then yeah, a lot more will need the secondary array. >> Similarly, the secondary array will only VERY rarely need to contain >> any pointers; most combined characters consist of a base and one >> combining, or a set of three characters at most. > > I don't know if you can make that claim for non-West European languages. I > don't > know enough about (for example) Slavic languages, or Thai, or Arabic, or > Chinese, to know whether (base + three combining characters) will be rare or > not. Not sure, but what I usually see is that one Chinese character gets one Unicode codepoint. But again, forcible decomposition may change this. > But emoji sequences will often require four code points, three of which will > be > in the supplementary planes. > > http://unicode.org/emoji/charts/emoji-zwj-sequences.html "Often"? I doubt that; a lot of emoji don't require that many. >> There'll be dramatic >> performance costs for strings where piles of combining characters get >> loaded on top of a single base, but at least they can be accurately >> represented. > > They can be accurately represented right now. E.g. there is nothing ambiguous > or > inaccurate about U+1F469 U+1F3FD U+200D U+1F52C, "woman scientist with medium > skin tone". I may have elided a bit too much here. Let's start with a simpler representation: a string is represented as a tuple of Python integer objects, each of which uses the original scheme. Now, that's able to represent everything, but it's stupidly expensive. The original multi-tiered scheme gives vast improvements for everything other than this case, but at least it doesn't make them unrepresentable (cf UCS-2). ChrisA -- https://mail.python.org/mailman/listinfo/python-list