In TUS 4.0 Section 5.3, p.111, the following is stated of default ignorable code points:

These characters are also ignored except with respect to specific, defined processes; for example, ZERO WIDTH NON-JOINER is ignored in collation. ... For more information, see Section 5.20, Default Ignorable Code Points.


But in Section 5.20, although there is a lot about rendering default ignorable code points, there is no further information about any other processing of them. The implication of that section seems to be that these characters are intended to be ignored in rendering but not in other processes such as collation. Is this or the summary in Section 5.3 in fact to be taken as the intention of the standard? Has the summary simply not been updated for consistency with the fuller details? Or has the fuller description been unintentionally restricted to rendering?

Is it in fact the intention that all default ignorable characters must always be ignored in collation? Or is it possible to tailor collation not to ignore them? The collation algorithm seems to suggest the latter, in that there seems to be no mention of these characters being obligatorily ignored - although I presume they have zero weight by default (in DUCET).

This has some quite serious implication for processing of texts including ZW(N)J, variation selectors etc.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/





Reply via email to