RE: Text Editors and Canonical Equivalence (was Coloured diacritics)

Philippe Verdy Thu, 11 Dec 2003 19:35:45 -0800

Peter Kirk wrote:
> I am sure that some tricks could be found to 
> simplify the indexing if necessary, e.g. using PUA or non-character code 
> points indexed into a special table to replace DGCs which cannot be 
> represented as a single character. (There are plenty of non-characters 
> available as you need to use UTF-32 here to avoid exactly the same 
> problems with surrogates.)


You're quite optimistic here: the total number of DGCs that can be encoded
in Unicode goes far beyond the capacity of PUAs and even of the whole
Unicode range itself.

I did not try to count them for the simplest cases, but possible DGCs are
nearly infinite:
- there's no upper limit for the number of diacritics you can combine with a
base character
- there's no limit in the number of base characters that can be used to
build Hangul syllables.

So how will you allocate PUAs? Using an internal lookup table stored with
the document that use these PUAs that translates only the DGCs used
internally into single PUAs ? Now how will you implement indexing with these
private private PUAs which change of semantics across documents? What is the
relevant scope for these PUAs?

For me it seems simpler (and more interoperable or integrable within
compound documents) to avoid PUAs in all cases where they can be encoded
using DGCs made of assigned code points.

Use of PUAs is a convenient tool to assign glyph IDs within fonts that
implement contextual forms referenced in the internal font lookup tables,
when these tables can be processed by an external renderer to select
contextual glyphs or to control ligation or kerning. The scope of these PUA
is directly limited to the font that use them to allow a renderer to create
Unicode strings that will finally be rendered using a basic string renderer.

Other uses of PUAs have a limited scope related to specific standards or
APIs or protocol layers in which they may be used to include some "markup"
data within a stream of Unicode characters. These PUAs are internal to the
process that use it, and the public erxternal interface will simply
ignore/drop/reject external strings containing any colliding PUA whose
semantic is not certified to match the one in the protocol scope -- the
other option being to remap external PUAs to REPLACEMENT CHARACTER if they
collide with internal PUAs, and to signal to the external program using that
interface that transmission or usage of these external PUAs is not supported
or that the interface can cause data loss.


__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com

<<attachment: winmail.dat>>

RE: Text Editors and Canonical Equivalence (was Coloured diacritics)

Reply via email to