Re: Relationship between Unicode and 10646 (was: Re: Shift-JIS conversion.)

Philippe Verdy Fri, 26 Nov 2004 10:42:11 -0800

From: "Doug Ewell" <[EMAIL PROTECTED]>

My impression is that Unicode and ISO/IEC 10646 are two distinct
standards, administered respectively by UTC and ISO/IEC JTC1/SC2/WG2,
which have pledged to work together to keep the standards perfectly
aligned and interoperable, because it would be destructive to both
standards to do otherwise.  I don't think of it at all as the "slave and
master" relationship Philippe describes.

Probably not with the assumptions that one can think about "slave and master", but it's still true that there can only be one standard body for the character repertoire, and one formal process for additions of new characters, even if two standard bodies are *working* (I don't say *decide*) in cooperation.

The alternative would have been that UTC and WG2 are allocated each some code space for making the allocations they want, but with the risk of duplicate assignments. I really prefer to see the system like the "master and slave" relationships, because it gets a simpler view for how characters can be assigned in the common repertoire.

For example, Unicode has no more rights than national standardization bodies making involved at ISO/IEC WG2. All of them will make proposals, all of them will amend proposals, or suggest modifications, or will negociate to create a final specification for the informal drafts. All what I see in the Unicode standardization process is that it will finally approve a proposal, but Unicode cannot declare it standard until there's been a formal agreement at ISO/IEC WG2, which really rules the effective allocations in the common repertoire, even if most of the preparation work will have been heavily discussed within UTC, creating the finalized proposal and with Unicode partners or with ISO/IEC members.

At the same time, ISO/IEC WG2 will also study the proposals made by other standardization bodies, including the specifications prepared by other ISO working groups, or by national standardization bodies. Unicode is not the only approved source of proposals and specifications for ISO/IEC WG2 (and I tend to think that Unicode best represent the interests of private companies, whilst national bodies are most often better represented by their permanent membership at ISO where they have full rights of voting or vetoing proposals, according to their national interests...)

The Unicode standard itself agrees to obey to ISO/IEC 10646 allocations in the repertoire (character names, representative glyphs, code points, and code blocks), but in exchange, ISO/IEC has agreed with Unicode to not decide about character properties or behavior (which are defined either by Unicode, or by national standards based on the ISO/IEC 10646 coded repertoire, for example the P.R.Chinese GB18030 standard, or by other ISO standards like ISO 646 and ISO 8859).

So, even if the UTC decides to veto a proposal submitted by Unicode members, nothing prevent the same members to find allies within national standard bodies, so that they submit the (modified) proposal to ISO/IEC 10646, instead of Unicode which refuses to transmit that proposal.

I want to demonstrate some recent example: the UTC decided to vote against the allocation of a new invisible character, with the properties of a letter, a zero-width, and the same allowances of break opportunities as letters, considering that the existing NBSP was enough, despite it causes various complexities related to the normative properties of NBSP used as a base character for combining diacritics. This proposal (that was previously in informal discussion) has been rejected by UTC, but this leaves Indian and Israeli standards with complex problems for which Unicode proposes no easy solution.

So nothing prevents India and Israel to reformulate the proposal at ISO/IEC WG2, which may then accept it, even if Unicode previously voted against it. If ISO/IEC WG2 accepts the proposal, Unicode will have no other choice than accepting it in the repertoire, and so giving to the new character some correct properties. Such proposal will be easily accepted by ISO/IEC WG2 if India and Israel demonstrate that the allocation allows making distinctions which are tricky or computationnally difficult or ambiguous to resolve when using NBSP. With a new distinct character, on the opposite, it can be demonstrated by ISO/IEC 10646 members to Unicode that defining its Unicode properties is not difficult, and simplifies the problem for correctly representing complex cases found in large text corpus.

Unicode may think that this is a duplicate allocation, because there will exist cases where two encoding are possible, but without the same difficulties for implementations of applications like full-text search, collation, or determination of break opportunities, notably in the many cases where the current Unicode rules are already contradicting the normative behavior of existing national standards (like ISCII in India). My opinion is that the two encodings will still survive, but text encoded with the new prefered character will be easier to process correctly, and over time, the legacy encodings using NBSP would be deprecated by usage, making the duplicate encodings less a critical issue for many applications that are written for simplicity using partial implementations of the Unicode properties... Legacy encodings will still exist, but users of these encoded texts will be given an optional opportunity to recode their texts to match with the new prefered encoding, without changing their applications.

Unicode already has tons of possible apparent duplicate encodings (see for example the non-canonically equivalent strings that can be created with multiple diacritics with the same combining class, despite they can't be made visually distinct, for example with some indian vowels, or with the presentation of some diacritics like the cedilla on some Latin letters; see also the characters that should have been defined as canonically equivalent but are not now, because Unicode has made string equivalence classes irrevokable, i.e. "stable", within an agreement signed with other standard bodies). Some purists may think that adding new apparent duplicates is a problem, but it will be less a problem if the users of the national standards directly used when using some scripts are exposed to tricky problems or ambiguities with the legacy encoding, that simply don't appear with the encoding using the new separate allocation.

The interests of Unicode and ISO/IEC 10646 are diverging: Unicode is working so that the common repertoire can be managed in existing softwares created by its private members, but ISO/IEC 10646 members are first concerned by the correct representation of their national languages, without loss of semantics.

In some cases, this correct representation conflicts with the simplest forms of implementations in Unicode-enabled softwares, requiring unjustified uses of large datasets for handling many exceptions, the absence of this dataset meaning that the text will be given wrong interpretations, so that text processing looses or changes parts of its semantics. (Note that many of the ambiguities come from the Unicode standard itself, which is the case for the normative behavior of NBSP at the begining of a word, or after a breakable SPACE... sometimes because of omissions in past versions of the standard, or because of unsuspected errors...)

The easiest solution to this problem: make it simpler to handle, using separate encodings when this solves the difficult ambiguities (notably if there are ambiguities about which Unicode version considered when the text was encoded, or one of its addenda or corrigenda); then publish a guide that makes clearly separate interpretations (semantics) for texts coded with the legacy character, and texts coded with the new apparent "duplicate" character.

The complex solution is to modify Unicode algorithms, and this may be even more difficult if this is part of the Unicode core standard, or in one of its standard annexes, or involves one of the normative character properties (like general classes, or combining classes), or the script classification of characters (script-specific versus common).

Re: Relationship between Unicode and 10646 (was: Re: Shift-JIS conversion.)

Reply via email to