Peter Kirk wrote: > I am sure that some tricks could be found to > simplify the indexing if necessary, e.g. using PUA or non-character code > points indexed into a special table to replace DGCs which cannot be > represented as a single character. (There are plenty of non-characters > available as you need to use UTF-32 here to avoid exactly the same > problems with surrogates.)
You're quite optimistic here: the total number of DGCs that can be encoded in Unicode goes far beyond the capacity of PUAs and even of the whole Unicode range itself. I did not try to count them for the simplest cases, but possible DGCs are nearly infinite: - there's no upper limit for the number of diacritics you can combine with a base character - there's no limit in the number of base characters that can be used to build Hangul syllables. So how will you allocate PUAs? Using an internal lookup table stored with the document that use these PUAs that translates only the DGCs used internally into single PUAs ? Now how will you implement indexing with these private private PUAs which change of semantics across documents? What is the relevant scope for these PUAs? For me it seems simpler (and more interoperable or integrable within compound documents) to avoid PUAs in all cases where they can be encoded using DGCs made of assigned code points. Use of PUAs is a convenient tool to assign glyph IDs within fonts that implement contextual forms referenced in the internal font lookup tables, when these tables can be processed by an external renderer to select contextual glyphs or to control ligation or kerning. The scope of these PUA is directly limited to the font that use them to allow a renderer to create Unicode strings that will finally be rendered using a basic string renderer. Other uses of PUAs have a limited scope related to specific standards or APIs or protocol layers in which they may be used to include some "markup" data within a stream of Unicode characters. These PUAs are internal to the process that use it, and the public erxternal interface will simply ignore/drop/reject external strings containing any colliding PUA whose semantic is not certified to match the one in the protocol scope -- the other option being to remap external PUAs to REPLACEMENT CHARACTER if they collide with internal PUAs, and to signal to the external program using that interface that transmission or usage of these external PUAs is not supported or that the interface can cause data loss. __________________________________________________________________ << ella for Spam Control >> has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com
<<attachment: winmail.dat>>