Re: Encoding

Philippe Verdy via Unicode Sun, 04 Nov 2018 13:01:48 -0800

I can take another example about what I call "legacy encoding" (which
really means that such encoding is just an "approximation" from which no
semantic can be clearly infered, except by using a non-determinist
heuristic, which can frequently make "false guesses").

Consider the case of the legacy Hangul "half-width" jamos: they were kept
in Unicode (as compatibility characters) not recommended for encoding
natural Korean text, because their semantic is not clear when they are used
in sequences: it's impossible to know clearly where semantically
significant syllable breaks occur, because they don't distinguish the
"leading" and "trailing consonants", and so it is not even possible to
clearly infer that any Hangul "half-width" vowel jamos is logically
attached to the same syllable as the "half-width" consonnant (or
consonnant+vowel) jamo that is encoded just before it. As a consequence,
you cannot safely convert Korean texts using these "half-width" jamos into
normal jamos: only an heuristic attempts to detertemine the syllable breaks
and then infer the "leading" or "trailing" semantic of consonnants. This
last semantic ("leading" or "trailing" is exactly like a letter case
distinction in Latin, so it can be said that the Korean alphabet is
bicameral for consonnants, but only monocameral for vowels, where each
Hangul syllable normally starts by an "uppercase-like" consonnant, or by a
consonnant filler which is also "uppercase-like", and that all other
consonnants and all vowels are "lowercase-like": the heuristic that
transforms the legacy "half-width" jamos into normal jamos just does the
same thing as the heuristic used in Latin that attempts to capitalize some
leading letters in words: it works frequently, but this also fails and that
heuristic is also lossy in Latin, just like it is lossy in Korean!).

The same can be said about the heuristics that attempt to infer an
abbreviation semantic from existing superscript letters (either encoded in
Unicode, or encoded as plain letters modified by superscripting style in
CSS or HTML, or in word processors for example): it fails to give the
correct guess most of the time if there's no user to confirm the actual
intended meaning

Such confirmation is the job of spell correctors in word processors: they
must clearly inform the user and let them decide, all what spell checkers
can do is to provide visual hints to the user editing the document, such as
the common wavy underline in red, that several interpretations are
possible, or this is not the preferrred encoding to use to convey the
correct semantic.

A spell checker may be instructed to do the conversion automatically, while
typing text, but there must be a way for the user to cancel this transform
and make his own decision about the real meaning if canceling the automatic
transform causes the "wavy red underline" to appear; the user may type
"Mr." then the wavy line will appear under these 3 characters, the spell
checker will propose to encode it as an abbreviation "Mr<combinining
abbrevitation mark>" or leave "Mr." unchanged (and no longer signaled) in
which case the dot remains a regular punctuation, and the "r" is not
modified. Then the user may choose to style the "r" with superscripting or
underlining, and a new wavy red underline will appear below the three
characters "M<styled r>.", proposing to only transform the <styled r> as
<superscript r> or <r,combining underline> and even when the user accepts
one of these suggestions it will remain "M<superscript r>." or
"M<r,combining underline>." where it is still possible to infer the
semantics of an abbreviation (propose to replace or keep the dot after it),
or doing nothing else and cancel these suggestions (to hide the wavy red
underline hint, added by the spell checker), or instruct the spell checker
that the meaning of the superscript r is that of a mathematical exponent,
or a chemical a notation.

In all cases, the user/author has full control of the intended meaning of
his text and an informed decision is made where all cases are now
distinguished. "Legacy" encoding can be kept as is (in Unicode), even if
it's no longer recommended, just like Unicode has documented that
half-width Hangul is deprecated (it just offers a "compatibility
decomposition" for NFKD or NFKC, but this is lossy and cannot be done
automatically without a human decision).

And the user/author can now freely and easily compose any abbreviation he
wishes in natural languages, without being limited by the reduced "legacy"
set of <superscript letters> encoded in Unicode (which should no longer be
extended, except for use as distinct plain letters needed in alphabets of
actual natural languages, or as possibly new IPA symbols), and without
using the styling tricks (of HTML/CSS, or of word processor documents,
spreadsheets, presentation documents allowing "'rich text" formats on top
of "plain text") which are best suitable for "free styling" of any human
text, without any additional semantics, (or as a legacy but insufficient
trick for maths and chemical notations).

Le dim. 4 nov. 2018 à 20:51, Philippe Verdy <[email protected]> a écrit :

> Note also that some other scripts have their own dedicated "abbreviation
> mark" encoded, but as distinctive punctuations or modifier letters: they
> are NOT combining. I do not advocate changing these scripts at all.
>
> As well I don't propose to instruct authors to use an <Asian abbreviation
> mark> after Latin/Greek/Letters/Arabic/Hebrew letters used in
> abbreviations. This would be non-sense, including visually, even if you can
> infer some semantics, as meaning this is effectively an abbreviation for
> text processing (this is still non-senses because this breaks existing
> segregations of scripts, delimitation of clusters, line breaking
> opportunities, and so on; and this approach would break because these
> <Asian abbreviation mark> can legally occur in isolation, without being
> necessarily attached to the previous cluster to modify it: the previous
> cluster, before the <Asian abbreviation mark> could be for example a
> whitespace, or a quotation mark)
>
> I don't propose the <combining abbreviation mark> as being suitable for
> mathematics exponents and Chemical notations (they still need something
> else to allow their superscript and subscripts to stack below each other,
> and the variation of <combining abbreviation mark> explicitly permitting it
> to be rendered as a dot or another suitable mark, depending on the base
> character of the combining sequence, is NOT suitable for these mathematics
> or chemical notations).
>
> Once again you need something else for these technical notations, but NOT
> the proposed <combining abbreviation mark>, and NOT EVEN the existing
> "modifier letters" <superscript letter X>, which were in fact first
> introduced only for IPA lowercase symbols, with some of them being then
> turned as "plain lowercase letters" in alphabets of some natural languages
> that have been recently romanized by borrowing IPA symbols (notably in
> Africa, where the initial letters borrowed from IPA, or some new specific
> letter variants with additional hooks, opening or strokes, were then
> followed by the addition of separate capital letters: these letters are NOT
> conveying any semantic of an abbreviation, and this is also NOT the case
> for their usage as IPA symbols).
>
> There's NO interoperability at all when taking **abusively** the existing
> "modifier letters" <superscript letter X> or <superscript digit> for use in
> abbreviations (or even in technical notations in maths or chemical
> formulas, where they DON'T work the way they should when used with
> subscripts, and cannot represent multiple layers of subscripts, e.g. for
> expressions like "2^2^2" in LaTeX for maths). Keep these "modifier letters"
> or <superscript digit> or <superscript punctuation> for use as plain
> letters or plain digits or plain punctuation or plain symbols (including
> IPA) in natural languages. Anything else is abusive ans hould be considered
> only as "legacy" encoding, not recommended at all in natural languages.
>
>
>
> Le dim. 4 nov. 2018 à 20:19, Philippe Verdy <[email protected]> a écrit :
>
>>
>>
>> Le dim. 4 nov. 2018 à 18:34, Marcel Schneider <[email protected]> a
>> écrit :
>>
>>> On 04/11/2018 17:45, Philippe Verdy wrote:
>>> Marcel
>>> * As already repeatedly stated, I’m taking the one bit where TUS states
>>> that all natural languages shall be given a semantically unambiguous (ie
>>> not introducing new ambiguity) and interoperable digital representation.
>>>
>>
>> I also support the sermantically unambiguous digital representation of
>> all natural languages.
>> Interoperability is always limited, even for existing script (including
>> Latin), that's why text renderers (and fonts) constantly need new
>> developments (but that does not need that these developments will be
>> deployed).
>> That's why we have to document reasonnable fallbacks for rendering on
>> limited platforms, each time this is possible (and in this case this is
>> clearly possible with extremely low efforts).
>>
>> Even the mere fallback to render the <combining abbreviation mark> as a
>> dotted circle (total absence of support) will not block completely reading
>> the abbreviation:
>> * you'll see "2e◌" (which is still better than only "2e", with minimal
>> impact) instead of
>> * "2◌" (which is worse ! this is still what already happens when you use
>> the legacy encoded <superscript e> which is also semantically ambiguous for
>> text processing), or
>> * "2e." (which is acceptable for rendering but ambiguous semantically for
>> text processing)
>>
>> So compare things faily: the solution I propose is EVEN MOREINTEROPERABLE
>> than using <superscript Latin  letters> (which is also impossible for
>> noting all abbrevations as it is limited to just a few letters, and most of
>> the time limited to only the few lowercase IPA symbols). It puts an end to
>> the pressure to encode superscript letters.
>>
>> If you want to support other notations (e.g. in chemical or
>> mathematics notations, where both superscript and subscript must be present
>> and stack together, and where the allowed varaition using a dot or similar)
>> you need another encoding and the existing legacy <superscript Latin
>> letters> are not suitable as well.
>>
>>
>>
>>

Re: Encoding

Reply via email to