Fwd: Unicode normalization issue summary

Arthur Barstow Sun, 15 Feb 2009 07:03:29 -0800

FYI. A summary of the Unicode Normalization issue by Addison Phillips.


Begin forwarded message:

From: "ext Phillips, Addison" <addi...@amazon.com>
Date: February 12, 2009 6:47:14 PM EST
Cc: "public-i18n-c...@w3.org" <public-i18n-c...@w3.org>
Subject: normalization issue summary for HCG
Archived-At: <http://www.w3.org/mid/4d25f22093241741bc1d0eebc2dbb1da0181c82...@ex-sea5-d.ant.amazon.com>
All,
Recently, the I18N Core WG raised an issue with Selectors-API andconsequently with CSS Selectors related to Unicode Normalization.This note is an attempt to summarize the issue and its attendanthistory at W3C, as well as to explain the current I18N Coreapproach to this topic. I am bringing this issue to HTML-CG so thatall the affected WGs can be aware of the debate. I realize thatthis email is very lengthy, but I think it is necessary to fullysummarize the issue. Some have wondered about what, exactly, "thoseI18N guys want from us" and I think it is important to provide aclear position.
At present, I don't have a fixed WG position, since the I18N WG iscarefully reviewing its stance. However, this is an issue ofsufficient importance and potential impact, that we felt itimportant to reach out to the broader community now. A summaryfollows:
---

First, some background about Unicode:
Unicode encodes a number of compatibility characters whose purposeis to enable interchange with legacy encodings and systems. Inaddition, Unicode uses characters called "combining marks" in anumber of scripts to compose glyphs (visual textual units) thathave more than one logical "piece" to them. Both of these casesmean that the same semantically identical "character" can beencoded by more than one Unicode character sequence.
A trivial example would be the character 'é' (a latin small letter'e' with an acute accent). It can be encoded as U+00E9 or as thesequence U+0065 U+0301. Unicode defines a concept called 'canonicalequivalence' in which two strings may be said to be equalsemantically even if they do not use the same Unicode code points(characters) in the same order to encode the text. It also definesa canonical decomposition and several normalization forms so thatsoftware can transform any canonically equivalent strings into thesame code point sequences for comparison and processing purposes.
Unicode also defines an additional level of decomposition, called'compatibility decomposition', for which additional normalizationforms exist. Many compatibility characters, unsurprisingly, havecompatibility decompositions. However, characters that share acompatibility decomposition are not considered canonicallyequivalent. An example of this is the character U+2460 (CIRCLEDDIGIT ONE). It has a compatibility decomposition to U+0031 (thedigit '1'), but it is not considered canonically equivalent to thenumber '1' in a string of text.
For the purposes of this document, we are discussing only canonicalequivalence and canonical decompositions.
Unicode has two particular rules about the handling of canonicalequivalence that concern us here: C6 and C7. C6 says thatimplementations shall not assume that two canonically equivalentsequences are distinct [there exist reasons why one might notconform to C6]. C7 says that *any* process may replace a charactersequence with its canonically equivalent sequence.
Normalization @ W3C:
Which brings us to XML, HTML, CSS, and so forth. I will use XML asthe main example here, since it is the REC that most directlyaddresses these issues. The other standards mainly ignore the issue(they don't require normalization, but they mostly don't address iteither).
XML, like many other W3C RECs, uses the Universal Character Set(Unicode) as one of its basic foundations. An XML document, infact, is a sequence of Unicode code points, even though the actualbits and bytes of a serialized XML file might use some othercharacter encoding. Processing is always specified in terms ofUnicode code points (logical characters). Like other W3C RECs, XMLsays nothing about canonical equivalence in Unicode. There existrecommendations to avoid compatibility characters (which includessome canonical equivalences) and 'Name' prevents starting thevarious named tokens with certain characters which includecombining marks.
Most implementations of XML assume that distinct code pointsequences are actually distinct, which is not in keeping withUnicode C6 [1]. That is, if I define one element <!ELEMENT éEMPTY> and another element <!ELEMENT é EMPTY>, they areusually considered to be separate elements, even though both definean element that looks like <é/> in a document and even though anytext process is allowed to convert one sequence into the other--according to Unicode. For that matter, a transcoder might produceeither of those sequences when converting a file from a non-Unicodelegacy encoding.
One might think that this would be a serious problem. However, mostsoftware systems consistently use a single Unicode representationto represent most languages/scripts, even though multiplerepresentations are theoretically possible in Unicode. This form istypically very similar to Unicode Normalization Form C (or "NFC"),in which as many combining marks as possible are combined with basecharacters to form a single code point (NFC also specifies theorder in which combining marks that cannot be combined appear; noUnicode normalization form guarantees that *no* combining markswill appear, as some languages cannot be encoded at all except viathe use of combining characters). As a result, few users encounterissues with Unicode canonical equivalence.
However, some languages and their writing systems have featuresthat expose or rely on canonical equivalence. For example, somelanguages make use of combining marks and the order of thecombining marks can vary. Other languages use multiple accent marksand their input systems may pre-compose or not compose charactersdepending on the keyboard layout, operating system, fonts, or thesoftware used to edit text. Vietnamese is an example of this. Sincecanonically equivalent text is (supposed to be) visuallyindistinguishable, users typically don't care that (for example)their Mac uses a different code point sequence than theirneighbor's Windows computer. These languages are sensitive tocanonical equivalence and rely on consistent normalization in orderto be used with a technology such as XML. Further, many of the Ur-technologies are now used in combination. For example, a site mightuse XML for data interchange, XSLT to extract the data into an HTMLpage for presentation, CSS to style that page, and AJAX for theuser to interact with the page.
With this potential problem in mind, eleven (!) years ago the I18NWG started to work on a Specification to address the use of Unicodein W3C specs. This work is collectively called the "Character Modelfor the World Wide Web" or 'CharMod' for short [2]. Initially, theWG approach was to recommend what was termed "early uniformnormalization" (EUN). In EUN, virtually all content and markup wassupposed to be in a single normalization form, specifically NFC.The WG identified the need for both document formats (e.g. a filecalled 'something.xml') to be in NFC as well as the parsed contentsof the document (individual elements, attributes, or content withinthe document). This was called "fully normalized" content.
For this recommendation to work, tools, keyboards, operatingenvironments, text editors, and so on would need to provide fornormalizing data either on input (as with most European languages)or when processing, saving, or interacting with the resulting data.Specifications were expected to require normalization wheneverpossible. In cases where normalization wasn't required at thedocument format level, de-normalized documents would be 'valid',but could cause warnings to be issued in a validator or de-normalized content could be normalized by tools at creation time.The benefit to this approach was that specifications and certainclasses of implementation could mostly assume that users mostly hadavoided the problems with canonical equivalence by always authoringdocuments in the same way. Using this approach, there would be lessneed, for example, for CSS Selectors to consider normalization,since both the style sheet (or other source of the selector) andthe document tree being matched would use the same code pointsequence for the same canonically equivalent text. The user-agentcould just compare strings. In the few cases where this wasn't thecase, the user would be responsible for fixing it, but generallythe user was responsible for carefully having constructed the issuein the first place (since their tools and formats would normallyhave used NFC).
I18N WG worked on this approach for a number of years whilelanguages with normalization risks began to develop appreciablecomputing support and a nascent Web presence. CharMod containedstrong recommendations but not requirements (with a notableexception that we'll cover in a moment) towards normalization.Since it didn't matter so long as content remained normalized andsince the most common languages were normalized by default,specifications generally didn't require any level of normalization(although they "should" do so), implementations generally ignorednormalization, tools did not implement it, and so forth.
There was one interesting exception to the recommendations inCharMod. String identity matching *required* (MUST) the use ofnormalization (requirement C312). This nod to canonical equivalencewas also ignored by most specs, implementations, and thus content.It should be noted that CSS Selectors is a string identity matchingcase and not merely one of the "SHOULD" cases.
From being a mostly theoretical problem, normalization has becomesomething that can be demonstrated in real usage scenarios in reallanguages. While only a quite small percentage of total content isaffected, it quite directly impacts specific languages [3]. TheI18N WG is engaged in finding out exactly how prevalent thisproblem is. It is possible that, despite having become a realproblem, it is still so strictly limited that it can be dealt withbest via other means that spec-level requirements.
In early 2005, the I18N WG decided that EUN as an approach was nottenable because many so many different system components,technologies, and so forth would be affected by a "requirement" tonormalize; that some technologies (such as keyboarding systems)were not up to the task; and that, as a result, content would notbe normalized uniformly. The decision was made to change fromfocusing on EUN towards a policy something like:
1. Recommend the use of normalized content ("EUN") as a "bestpractice" for content authors and implementers.2. Since content might not be normalized, require specificationsaffected by normalization to address normalization explicitly.
Surprisingly, none of the requirements are actually changed by thisdifference in focus. Note that this did not mean that normalizationwould be required universally; it only meant that WGs would beasked to consider the impact or, in some cases, to change theirspecification.
In 2008, at the TPAC, I18N WG reviewed the current WD with an eyetowards finally completing the normalization portion of the work (aseparate working group in the Internationalization Activity hasbeen chartered between 2005 and 2008 to do the work; this workinggroup expired with no progress and "I18N Core" inherited theunfinished work). I18N's review revealed that the current documentstate was not sufficient for advising spec, content, orimplementation authors about when and how to handle the new "late(r)" normalization. The same review produced generalacknowledgement that there now existed significant need based onreal content for normalization to be handled by W3C Specs.
At the very end of 2008, I18N WG also reviewed the Selectors-APIdraft produced by WebApps. In reviewing this document, the WG notedthat Selectors, upon which API is based, did not addressnormalization. Other recent REC-track documents had also beenadvised about normalization and had ended up requiring the use ofNFC internally. However, in the case of Selectors-API, theselectors in question were in CSS3 and were in a late working draftstate. CSS WG responded to this issue and a long thread hasdeveloped on our combined mail lists, in a wiki, and elsewhere.
Over the past two-plus weeks, the I18N WG has solicited advice andcomments from within its own community, from Unicode, and from thevarious CSS (style), XML, and HTML communities. We have embarked ona full-scale review of what position makes the most sense for theW3C to hold. In our most recent conference call (11 February), weasked members and our Unicode liaison to gather information on theoverall scope of the problem on the Web today. We also aregathering information on the impact of different kinds ofnormalization recommendation. We had expected to complete ourreview at the 11 February concall, but feel we need an additionalweek.
There are a few points of emerging consensus within I18N. Inparticular, if normalization is required, such a requirementprobably could be limited to identifier tokens, markup, and otherformal parts of document formats. Content itself should notgenerally be required to be normalized (a recommendation shouldcertainly be made and normalization, of course, is always permittedby users or some process--see Unicode C7), in part because thereexist use cases for de-normalized content.
The other emerging consensus is that canonical equivalence needs tobe dealt with once and for all. WGs should not have the CharModsword hanging over them and implementers and content authors shouldget clear guidance. During this review, I18N is considering allpossible positions, from "merely" making normalization a bestpractice to advocating the retrofitting of normalization to ourcore standards (as appropriate, see above).
One of the oft-cited reasons why normalization should not beintroduced is implementation performance. Assuming, for a moment,that documents are allowed to be canonicalized for processingpurposes, our experience suggests that overall performance impactcan be limited. There exist strategies for checking and normalizingdata that are very efficient, in part owing to the relative rarityof denormalized data, even in the affected languages. This documentwill not attempt to outline the performance cases for or againstnormalization, except to note that performance *is* an importantconsideration and *must* be addressed.
I hope this summary is helpful in discussing normalization. I wantto raise this issue now so that all of the affected parts of theW3C community can consider this issue and how it affects theirspecifications/implementations/tests/tools/etc. As a consensus(hopefully) emerges (not just within I18N), we should be in aposition to finally resolve the normalization conundrum bestproceed to create a global Web that works well for all the world'susers.
Kind Regards,

Addison (for I18N)
[1] http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf UnicodeConformance
[2] http://www.w3.org/TR/charmod-norm/ CharMod-Normalization part
[3] http://lists.w3.org/Archives/Public/public-i18n-core/2009JanMar/0128.html Ishida example[a] http://lists.w3.org/Archives/Public/public-i18n-core/2009JanMar/0182.html and another list
Addison Phillips
Globalization Architect -- Lab126
Chair -- W3C Internationalization WG

Internationalization is not a feature.
It is an architecture.

Fwd: Unicode normalization issue summary

Reply via email to