I can take another example about what I call "legacy encoding" (which really means that such encoding is just an "approximation" from which no semantic can be clearly infered, except by using a non-determinist heuristic, which can frequently make "false guesses").
Consider the case of the legacy Hangul "half-width" jamos: they were kept in Unicode (as compatibility characters) not recommended for encoding natural Korean text, because their semantic is not clear when they are used in sequences: it's impossible to know clearly where semantically significant syllable breaks occur, because they don't distinguish the "leading" and "trailing consonants", and so it is not even possible to clearly infer that any Hangul "half-width" vowel jamos is logically attached to the same syllable as the "half-width" consonnant (or consonnant+vowel) jamo that is encoded just before it. As a consequence, you cannot safely convert Korean texts using these "half-width" jamos into normal jamos: only an heuristic attempts to detertemine the syllable breaks and then infer the "leading" or "trailing" semantic of consonnants. This last semantic ("leading" or "trailing" is exactly like a letter case distinction in Latin, so it can be said that the Korean alphabet is bicameral for consonnants, but only monocameral for vowels, where each Hangul syllable normally starts by an "uppercase-like" consonnant, or by a consonnant filler which is also "uppercase-like", and that all other consonnants and all vowels are "lowercase-like": the heuristic that transforms the legacy "half-width" jamos into normal jamos just does the same thing as the heuristic used in Latin that attempts to capitalize some leading letters in words: it works frequently, but this also fails and that heuristic is also lossy in Latin, just like it is lossy in Korean!). The same can be said about the heuristics that attempt to infer an abbreviation semantic from existing superscript letters (either encoded in Unicode, or encoded as plain letters modified by superscripting style in CSS or HTML, or in word processors for example): it fails to give the correct guess most of the time if there's no user to confirm the actual intended meaning Such confirmation is the job of spell correctors in word processors: they must clearly inform the user and let them decide, all what spell checkers can do is to provide visual hints to the user editing the document, such as the common wavy underline in red, that several interpretations are possible, or this is not the preferrred encoding to use to convey the correct semantic. A spell checker may be instructed to do the conversion automatically, while typing text, but there must be a way for the user to cancel this transform and make his own decision about the real meaning if canceling the automatic transform causes the "wavy red underline" to appear; the user may type "Mr." then the wavy line will appear under these 3 characters, the spell checker will propose to encode it as an abbreviation "Mr<combinining abbrevitation mark>" or leave "Mr." unchanged (and no longer signaled) in which case the dot remains a regular punctuation, and the "r" is not modified. Then the user may choose to style the "r" with superscripting or underlining, and a new wavy red underline will appear below the three characters "M<styled r>.", proposing to only transform the <styled r> as <superscript r> or <r,combining underline> and even when the user accepts one of these suggestions it will remain "M<superscript r>." or "M<r,combining underline>." where it is still possible to infer the semantics of an abbreviation (propose to replace or keep the dot after it), or doing nothing else and cancel these suggestions (to hide the wavy red underline hint, added by the spell checker), or instruct the spell checker that the meaning of the superscript r is that of a mathematical exponent, or a chemical a notation. In all cases, the user/author has full control of the intended meaning of his text and an informed decision is made where all cases are now distinguished. "Legacy" encoding can be kept as is (in Unicode), even if it's no longer recommended, just like Unicode has documented that half-width Hangul is deprecated (it just offers a "compatibility decomposition" for NFKD or NFKC, but this is lossy and cannot be done automatically without a human decision). And the user/author can now freely and easily compose any abbreviation he wishes in natural languages, without being limited by the reduced "legacy" set of <superscript letters> encoded in Unicode (which should no longer be extended, except for use as distinct plain letters needed in alphabets of actual natural languages, or as possibly new IPA symbols), and without using the styling tricks (of HTML/CSS, or of word processor documents, spreadsheets, presentation documents allowing "'rich text" formats on top of "plain text") which are best suitable for "free styling" of any human text, without any additional semantics, (or as a legacy but insufficient trick for maths and chemical notations). Le dim. 4 nov. 2018 à 20:51, Philippe Verdy <verd...@wanadoo.fr> a écrit : > Note also that some other scripts have their own dedicated "abbreviation > mark" encoded, but as distinctive punctuations or modifier letters: they > are NOT combining. I do not advocate changing these scripts at all. > > As well I don't propose to instruct authors to use an <Asian abbreviation > mark> after Latin/Greek/Letters/Arabic/Hebrew letters used in > abbreviations. This would be non-sense, including visually, even if you can > infer some semantics, as meaning this is effectively an abbreviation for > text processing (this is still non-senses because this breaks existing > segregations of scripts, delimitation of clusters, line breaking > opportunities, and so on; and this approach would break because these > <Asian abbreviation mark> can legally occur in isolation, without being > necessarily attached to the previous cluster to modify it: the previous > cluster, before the <Asian abbreviation mark> could be for example a > whitespace, or a quotation mark) > > I don't propose the <combining abbreviation mark> as being suitable for > mathematics exponents and Chemical notations (they still need something > else to allow their superscript and subscripts to stack below each other, > and the variation of <combining abbreviation mark> explicitly permitting it > to be rendered as a dot or another suitable mark, depending on the base > character of the combining sequence, is NOT suitable for these mathematics > or chemical notations). > > Once again you need something else for these technical notations, but NOT > the proposed <combining abbreviation mark>, and NOT EVEN the existing > "modifier letters" <superscript letter X>, which were in fact first > introduced only for IPA lowercase symbols, with some of them being then > turned as "plain lowercase letters" in alphabets of some natural languages > that have been recently romanized by borrowing IPA symbols (notably in > Africa, where the initial letters borrowed from IPA, or some new specific > letter variants with additional hooks, opening or strokes, were then > followed by the addition of separate capital letters: these letters are NOT > conveying any semantic of an abbreviation, and this is also NOT the case > for their usage as IPA symbols). > > There's NO interoperability at all when taking **abusively** the existing > "modifier letters" <superscript letter X> or <superscript digit> for use in > abbreviations (or even in technical notations in maths or chemical > formulas, where they DON'T work the way they should when used with > subscripts, and cannot represent multiple layers of subscripts, e.g. for > expressions like "2^2^2" in LaTeX for maths). Keep these "modifier letters" > or <superscript digit> or <superscript punctuation> for use as plain > letters or plain digits or plain punctuation or plain symbols (including > IPA) in natural languages. Anything else is abusive ans hould be considered > only as "legacy" encoding, not recommended at all in natural languages. > > > > Le dim. 4 nov. 2018 à 20:19, Philippe Verdy <verd...@wanadoo.fr> a écrit : > >> >> >> Le dim. 4 nov. 2018 à 18:34, Marcel Schneider <charupd...@orange.fr> a >> écrit : >> >>> On 04/11/2018 17:45, Philippe Verdy wrote: >>> Marcel >>> * As already repeatedly stated, I’m taking the one bit where TUS states >>> that all natural languages shall be given a semantically unambiguous (ie >>> not introducing new ambiguity) and interoperable digital representation. >>> >> >> I also support the sermantically unambiguous digital representation of >> all natural languages. >> Interoperability is always limited, even for existing script (including >> Latin), that's why text renderers (and fonts) constantly need new >> developments (but that does not need that these developments will be >> deployed). >> That's why we have to document reasonnable fallbacks for rendering on >> limited platforms, each time this is possible (and in this case this is >> clearly possible with extremely low efforts). >> >> Even the mere fallback to render the <combining abbreviation mark> as a >> dotted circle (total absence of support) will not block completely reading >> the abbreviation: >> * you'll see "2e◌" (which is still better than only "2e", with minimal >> impact) instead of >> * "2◌" (which is worse ! this is still what already happens when you use >> the legacy encoded <superscript e> which is also semantically ambiguous for >> text processing), or >> * "2e." (which is acceptable for rendering but ambiguous semantically for >> text processing) >> >> So compare things faily: the solution I propose is EVEN MOREINTEROPERABLE >> than using <superscript Latin letters> (which is also impossible for >> noting all abbrevations as it is limited to just a few letters, and most of >> the time limited to only the few lowercase IPA symbols). It puts an end to >> the pressure to encode superscript letters. >> >> If you want to support other notations (e.g. in chemical or >> mathematics notations, where both superscript and subscript must be present >> and stack together, and where the allowed varaition using a dot or similar) >> you need another encoding and the existing legacy <superscript Latin >> letters> are not suitable as well. >> >> >> >>