Re: Specification of Encoding of Plain Text

Asmus Freytag Tue, 10 Jan 2017 17:32:17 -0800

On 1/10/2017 2:54 PM, Richard Wordingham wrote:

On Tue, 10 Jan 2017 13:12:47 -0800
Asmus Freytag <asm...@ix.netcom.com> wrote:

Unicode clearly doesn't forbid most sequences in complex scripts,
even if they cannot be expected to render properly and otherwise
would stump the native reader.

Is this expectation based on sequence enforcement in the renderer?  The
main problem with getting text to render reasonably (not necessarily as
desired) is now anti-phishing.

You mean anti-spoofing. There are many types of phishing attempts thatdo not

rely on spoofing identifiers.

There are many different tacks that can be taken to make spoofing moredifficult.


Among them, for critical identifiers:
1)  allow only a restricted repertoire
2)  disallow certain sequences
3) use a registry and
   3a) define sets of labels that overlap (variant sets)
   3b) restrict actual labels to be in disjoint sets
          (one label blocks all others in the same variant set)

The ICANN work on creating label generation rules attempts to implement

these strategies (currently for 28 scripts in the Root Zone of the DNS).The

work on the first half dozen scripts is basically completed.

The Unicode standard does define what
short sequences of characters mean.  The problem is that then, outside
the Apple world, it seems to be left to Microsoft to decide what longer
sequences they will allow.


MS and Apple are not the only ones writing renderers.

The advantage of the text I brought to your attention is the way it
is formalized and that it was created with local expertise. The
disadvantage from your perspective is that the scope does not match
with your intended use case.

Perhaps ICANN will be the industry-wide definer.  However, to stay with
Indic rendering, one may have cases where CVC and CCV orthographic
syllables have little to no visible difference.  The Khmer writing
system once made much greater use of CVC syllables.  For reproducing
older texts, one might be forced to encode phonetic CVC as though it
were CCV.

The restriction on sequences appropriate as an anti-spoofing measure arenot appropriate on general encoded text! For one, the Root Zoneexplicitly disallows anything that's not in "widespread everyday" use.This covers most transcriptions of "historic" texts, as well asreligious or technical (phonetic) notations and transcriptions.

But restriction of repertoire and sequences goes only so far. You willalways have a residual set of labels that overlap to a degree that usersdo not reliably distinguish them. (Actually many disjoint sets ofoverlapping labels). The hard core of these are labels that appear(practically) identical. There's a further aura of more or less confusables.

Mathematically these two behave differently: a set of (practically)identical labels is symmetric and transitive, while a set of merelysimilar labels may be symmetric, but is not transitive. If A isequivalent to B and B to C then A is equivalent to C (transitivity).However, for merely similar labels there's a non-zero "similaritydistance", if you will. If you try to chain similarity together viatransitivity then you might exceed a similarity threshold and your endpoints (e.g. A and C above) may both be similar to B but not(sufficiently) to each other.

The project I'm involved in tackles only transitive forms of equivalence(whether visual or semantic).

Collisions based on these equivalences can be handled with labelgeneration rulesets defined per RFC 7940, which allow registrationpolicies that are automated.

The further "halo" of "merely" similar labels needs to be handled withadditional technology that can handle concepts like similarity distance.

From a Unicode perspective, there's a virtue in not over specifyingsequences, because you don't want to be caught having to re-encodeentire scripts should the conventions for the use of the elements makingup the script change in an orthography reform!

That does not mean that Unicode (at all times) endorses all permutationsof free-form sequences as equally valid.

A./


This is already the case, through error rather than design,
with the Thai script in Tai Tham.  This affects about 30% of the
Northern Thai lexicon*, and I believe even a higher proportion when
adjusted for word frequency. Now, to fight phishing, I have always
believed that some brutal folding would be required for Tai Tham, which
is why I suggested that the S.SA ligature be encoded (U+1A54 TAI THAM
LETTER GREAT SA).

*I've sampled the MFL dictionary.  I suspect a bias to untruncated forms
in loans from Pali, such as _kathina_ rather than _kathin_.  If my
suspicion is correct, the proportion would be even higher.

However, I believe there is some advantage in distinguishing CVC and
CCV at the code level, even where there is no visual difference.  To
display small visual differences, perhaps we will be forced to beg for
mark-up to make the distinction visible.

In Tai Tham, there are very few CCV-CVC visual homographs in native
words because of the phonological structure of Northern Thai, and one
can usually guess whether the word is CCV or CVC.

Richard.

Re: Specification of Encoding of Plain Text

Reply via email to