Le ven. 2 nov. 2018 à 16:20, Marcel Schneider via Unicode < unicode@unicode.org> a écrit :
> That seems to me a regression, after the front has moved in favor of > recognizing Latin script needs preformatted superscript. The use case is > clear, as we have ª, º, and n° with degree sign, and so on as already > detailed in long e-mails in this thread and elsewhere. There is no point > in setting up or maintaining a Unicode policy stating otherwise, as such > a policy would be inconsistent with longlasting and extremely widespread > practice. > Using variation selectors is only appropriate for these existing (preencoded) superscript letters ª and º so that they display the appropriate (underlined or not underlined) glyph. It is not a solution for creating superscripts on any letters and mark that it should be rendered as superscript (notably, the base letter to transform into superscript may also have its own combining diacritics, that must be encoded explicitly, and if you use the varaition selector, it should allow variation on the presence or absence of the underline (which must then be encoded explicitly as a combining character. So finally what we get with variation selectors is: <baseline letter, variation selector, combining diacritic> and <baselineletter precombined with the diacritic, variation selector> which is NOT canonically equivalent. Using a combining character avoids this caveat: <baseline letter, combining diacritic, combining abbreviation mark> and <baselineletter precombined with the diacritic, combining abbreviation mark> which ARE canonically equivalent. And this explicitly states the semantic (something that is lost if we are forced to use presentational superscripts in a higher level protocol like HTML/CSS for rich text format, and one just extracts the plain text; using collation will not help at all, except if collators are built with preprocessing that will first infer the presence of a <combining abbreviation mark> to insert after each combining sequence of the plain-text enclosed in a italic style). There's little risk: if the <combining abbreviation mark> is not mapped in fonts (or not recognized by text renderers to create synthetic superscript scripts from existing recognized clusters), it will render as a visible .notdef (tofu). But normally text renderers recognize the basic properties of characters in the UCD and can see that <combining abbreviation mark> has a combining mark general property (it also knows that it has a 0 combinjing class, so canonical equivalences are not broken) to render a better symbols than the .notdef "tofu": it should better render a dotted circle. Even if this tofu or dotted circle is rendered, it still explicitly marks the presence of the abbreviation mark, so there's less confusion about what is preceding it (the combining sequence that was supposed to be superscripted). The <combining abbreviation mark> can also have its own <variation selector> to select other styles when they are optional, such as adding underlines to the superscripted letter, or rendering the letter instead as underscript, or as a small baseline letter with a dot after it: this is still an explicit abbreviation mark, and the meaning of the plein text is still preserved: the variation selector is only suitable to alter the rendering of a cluster when it has effectively several variants and the default rendering is not universal, notably across font styles initially designed for specific markets with their own local preferences: the variation selector still allows the same fonts to map all known variants distinctly, independantly of the initial arbitrary choice of the default glyph used when the variation selector is missing). Even if fonts (or text renderers may map the <combining abbreviation mark> to variable glyphs, this is purely stylictic, the semantic of the plain text is not lost because the <combining abbreviation mark> is still there. There's no need of any rich-text to encode it (the rich -text styles are not explicitly encoding that a superscript is actually an abbreviation mark, so it cannot also allow variation like rendering an underscript, or a baseline small glyph with an added dot. Typically a <combining abbreviation mark> used in an English style would render the letter (or cluster) before it as a "small" letter without any added dot. So I really think that <combining abbreviation mark> is far better than: * using preencoded superscript letters (they don't map all the necessary repertoire of clusters where the abbreviation is needed, it now just covers Basic Latin, ten digits, plus and minus signs, and the dot or comma, plus a few other letters like stops; it's impossible to rencode the full Unicode repertoire and its allowed combining sequences or extended default grapheme clusters!), * or using variation selectors to make them appear as a superscript (does not work with all clusters containing other diacritics like accents), * or using rich-text styling (from which you cannot safely infer any semantic (there no warranty that M<sup>r</sup> in HTML is actually an abbreviation of "Mister"; in HTML this is encoded elsewhere as <abbr title="Mister">M<sup>r</sup></abbr> or <abbr>M<sup>r</sup></abbr> (the semantic of the abbreviation has to be looked a possible <abbr> container element and the meaning of the abbreviation is to look inside its title attribute, so obviously this requires complex preprocessing before we can infer a plaintext version <M,r,combining abbreviation mark> (suitable for example in plain-text searches where you don't want to match a mathematical object M, like a matrix, elevated to the power r, or a single plaintext M followed by a footnote call noted by the letter "r"). It solves all practical problems: legacy encoding using the preencoded superscript Latin letters (aka "modifier letters") should have never been used or needed (not even for IPA usage which could have used an explicit <combining IPA symbol mark> for its superscripted symbols, or for its distinctive "a" and "g"). We should not have needed to encode the variants for "a" and "g": these were old hacks that broke the Unicode character encoding model since the beginning. However only roundtrip compatibility with legacy non UCS charsets milited only for keeping the ordinal feminine or ordinal masculine mark, or the "Numero" cluster (actually made of two letters, the second one followed by an implicit abbreviation mark, but transformed in the legacy charset to be treated as a single unbreakable cluster containing only one symbol; even Unicode considers the abbreviated Numero as being only "compatibility equivalent" to the letter N followed by the masculine ordinal symbol, the latter being also only "compatibility equivalent" to a letter o with an implicit superscript, but also with an optional combining underline). All these superscripts in Unicode (as well as Mathematical "styled" letters, which were also completely unnecessary and will necessarily be incomplete for the intended usage) are now to be treated only as legacy practices, they should be deprecated in favor of the more semantic and logical character encoding model, deprecating complelely the legacy visual encoding. Only precombined characters, recognized by canonical equivalences are part of the standard and may be kept as "non"-legacy: they still fit in the logical encoding. As well the extended default grapheme clusters include the precomposed Hangul LVT and LV syllables, and CGJ used before combining marks with non-zero combining class, and variation selectors used only after base letters with the zero combining class and that start the extended default graphgeme clusters. Let's return to the root of the far better logical encoding which remains the recommended practice. All the rest is legacy (some of them came from decision taken to preserve roundtrip compatibility with legacy charsets, including prepended letters in Thai, and so we have a few compatibility characters (which are not the recommended practive), but the rest was bad decisions made by Unicode and ISO WG to break the logical character encoding model.