You keep harping on that, but we really had no choice in that matter. The definition of normalization in UAX #15 was internally inconsistent. Certain implementations of the UAX algorithm would exhibit unacceptably aberrant behavior, although only in a small number of degenerate cases, none of which occurring in ordinary text. The problems are:
1. Broken Idempotency. A non-idempotent implementation by its very nature cannot be stable, because repeated application of a non-idempotent normalization could produce different results.The application of the inconsistent interpretation therefore causes fundamental problems for implementations as further outlined in PRI#29; briefly, these are comparable to using a comparison function that isn't transitive when sorting. 2. Broken Canonical Equivalence. The inconsistent interpretation of the old UAX version could "normalize" some text to something that is not canonically equivalent to the input -- it changes some text to some completely different text. 3. Broken Canonical Order. Application of NFC[old UAX] or NFKC[old UAX] produces output that is not only different text (not canonically equivalent) but also not in canonical order. As a result, something returned from a normalization function may not even pass the normalization quick check: NFC_quick_check(NFC(string))=NO. After carefully evaluating the nature and effects of this inconsistency the UTC reached a decision to address these problems as follows: The current version of UAX #15 in Unicode 4.1.0 addresses the internal inconsistency. The changes do not affect any versions of UAX #15 prior to Unicode 4.1.0 and therefore do not affect stringprep or IDN. No backwards-compatibility problems will be introduced as a result of the changes. Stringprep and IDN rely on Unicode 3.2 version of UAX #15, which is: http://www.unicode.org/unicode/reports/tr15/tr15-22.html Implementations that claim conformance to Unicode 3.2 normalization may not produce identical results in all cases, and may not produce *correct* normalizations, because versions of UAX #15 prior to 4.1.0 have been internally inconsistent. While normalization problems only happen in degenerate cases, the inconsistency in the definition is significant enough that UTC felt compelled to make the change. During deliberations, UTC did discuss stability policies in the standard, and concluded that this inconsistency itself is unstable; it led to demonstrably divergent implementations, and could not stand without correction. In addition to the new 4.1.0 version of UAX #15, the UTC decided to issue a corrigendum which can be applied to other versions of Unicode. None of the prior versions of the Unicode Standard or its annexes will be changed in any way. Any implementation that claims conformance to Unicode 3.2 can stay precisely the same. Only if an implementation claims conformance to 3.2 plus the new corrigendum, or to version 4.1.0 or later of Unicode, would it change. So the current stringprep and IDN are not affected. When it comes time to update stringprep to a new version of Unicode, such as 4.1.0, there are two paths that IETF can take: (a) simply update to the newer version, or (b) specify a method which takes the previous algorithm and applies it to the new Unicode data. Option (a) sacrifices some compatibility, although (1) strings that have already been stringprepped *once* with the old version will have the same results under either version, and (2) the UTC does not expect any real data to contain the degenerate cases that trigger the problem. The UTC strongly recommends against Option (b). While it maintains backwards compatibility It does not fix the underlying problems: two successive applications of stringprep can still result in different strings. And if you look carefully at the stability requirements, you see "If a string contains only characters from a given version of the Unicode Standard (e.g., Unicode 3.1.1), and it is put into a normalized form in accordance with that version of Unicode, then it will be in normalized form according to any past or future versions of Unicode. " Which is true, even after applying PRI #29. It would also be interesting to me to see the level of stability that is guaranteed by the other organizations. I know that there are W3C Recommendations that do not maintain perfect stability. How about the IETF? Is there a policy that any RFC that obsoletes another RFC is required to be absolutely -- bug-for-bug -- backwards compatible? âMark ----- Original Message ----- From: "Simon Josefsson" <[EMAIL PROTECTED]> To: "Erik van der Poel" <[EMAIL PROTECTED]> Cc: <[email protected]> Sent: Saturday, March 12, 2005 03:04 Subject: [idn] Re: stability > Erik van der Poel <[EMAIL PROTECTED]> writes: > > > All, > > > > This is probably well known to most of you, but the General Category > > Value in the Unicode Character Database and the stability of that value > > are not very relevant to IDNA, which does not depend on the Unicode > > Categories. > > > > IDNA depends on the Unicode Normalization Form KC table, and there have > > been very few changes indeed in this table: > > > > http://www.unicode.org/Public/UNIDATA/NormalizationCorrections.txt > > Don't forget the normalization flaw in Unicode 3.2 NFKC discussed in: > > http://www.unicode.org/review/pr-29.html > > Apparently the recommendation will be applied to future Unicode > versions. > > PR-29 doesn't merely affect a small set of code points, but rather a > class of strings. The special strings are all unstable under NFKC3.2. > > I think PR-29 is a useful example to consider when deciding how much > trust you should place in the UTC's stability guarantees. The UTC's > track record in this area suggest to me that the guarantee is > worthless in practice. I haven't seen an evaluation of alternative > solutions to the PR-29 problem. Not even signs that alternative > approaches were considered. I would have expected both. > > > Also, IDNA apps depend on tables for converting from various non-Unicode > > encodings to Unicode. This is another place where instability could > > affect lookups, potentially even in dangerous ways. Stringprep and IDNA > > already mention this issue in their Security Considerations sections. > > Right. > > Thanks, > Simon > >
