FYI. A summary of the Unicode Normalization issue by Addison Phillips.
Begin forwarded message:
From: "ext Phillips, Addison" <addi...@amazon.com>
Date: February 12, 2009 6:47:14 PM EST
Cc: "public-i18n-c...@w3.org" <public-i18n-c...@w3.org>
Subject: normalization issue summary for HCG
Archived-At: <http://www.w3.org/mid/
4d25f22093241741bc1d0eebc2dbb1da0181c82...@ex-sea5-d.ant.amazon.com>
All,
Recently, the I18N Core WG raised an issue with Selectors-API and
consequently with CSS Selectors related to Unicode Normalization.
This note is an attempt to summarize the issue and its attendant
history at W3C, as well as to explain the current I18N Core
approach to this topic. I am bringing this issue to HTML-CG so that
all the affected WGs can be aware of the debate. I realize that
this email is very lengthy, but I think it is necessary to fully
summarize the issue. Some have wondered about what, exactly, "those
I18N guys want from us" and I think it is important to provide a
clear position.
At present, I don't have a fixed WG position, since the I18N WG is
carefully reviewing its stance. However, this is an issue of
sufficient importance and potential impact, that we felt it
important to reach out to the broader community now. A summary
follows:
---
First, some background about Unicode:
Unicode encodes a number of compatibility characters whose purpose
is to enable interchange with legacy encodings and systems. In
addition, Unicode uses characters called "combining marks" in a
number of scripts to compose glyphs (visual textual units) that
have more than one logical "piece" to them. Both of these cases
mean that the same semantically identical "character" can be
encoded by more than one Unicode character sequence.
A trivial example would be the character 'é' (a latin small letter
'e' with an acute accent). It can be encoded as U+00E9 or as the
sequence U+0065 U+0301. Unicode defines a concept called 'canonical
equivalence' in which two strings may be said to be equal
semantically even if they do not use the same Unicode code points
(characters) in the same order to encode the text. It also defines
a canonical decomposition and several normalization forms so that
software can transform any canonically equivalent strings into the
same code point sequences for comparison and processing purposes.
Unicode also defines an additional level of decomposition, called
'compatibility decomposition', for which additional normalization
forms exist. Many compatibility characters, unsurprisingly, have
compatibility decompositions. However, characters that share a
compatibility decomposition are not considered canonically
equivalent. An example of this is the character U+2460 (CIRCLED
DIGIT ONE). It has a compatibility decomposition to U+0031 (the
digit '1'), but it is not considered canonically equivalent to the
number '1' in a string of text.
For the purposes of this document, we are discussing only canonical
equivalence and canonical decompositions.
Unicode has two particular rules about the handling of canonical
equivalence that concern us here: C6 and C7. C6 says that
implementations shall not assume that two canonically equivalent
sequences are distinct [there exist reasons why one might not
conform to C6]. C7 says that *any* process may replace a character
sequence with its canonically equivalent sequence.
Normalization @ W3C:
Which brings us to XML, HTML, CSS, and so forth. I will use XML as
the main example here, since it is the REC that most directly
addresses these issues. The other standards mainly ignore the issue
(they don't require normalization, but they mostly don't address it
either).
XML, like many other W3C RECs, uses the Universal Character Set
(Unicode) as one of its basic foundations. An XML document, in
fact, is a sequence of Unicode code points, even though the actual
bits and bytes of a serialized XML file might use some other
character encoding. Processing is always specified in terms of
Unicode code points (logical characters). Like other W3C RECs, XML
says nothing about canonical equivalence in Unicode. There exist
recommendations to avoid compatibility characters (which includes
some canonical equivalences) and 'Name' prevents starting the
various named tokens with certain characters which include
combining marks.
Most implementations of XML assume that distinct code point
sequences are actually distinct, which is not in keeping with
Unicode C6 [1]. That is, if I define one element <!ELEMENT é
EMPTY> and another element <!ELEMENT é EMPTY>, they are
usually considered to be separate elements, even though both define
an element that looks like <é/> in a document and even though any
text process is allowed to convert one sequence into the other--
according to Unicode. For that matter, a transcoder might produce
either of those sequences when converting a file from a non-Unicode
legacy encoding.
One might think that this would be a serious problem. However, most
software systems consistently use a single Unicode representation
to represent most languages/scripts, even though multiple
representations are theoretically possible in Unicode. This form is
typically very similar to Unicode Normalization Form C (or "NFC"),
in which as many combining marks as possible are combined with base
characters to form a single code point (NFC also specifies the
order in which combining marks that cannot be combined appear; no
Unicode normalization form guarantees that *no* combining marks
will appear, as some languages cannot be encoded at all except via
the use of combining characters). As a result, few users encounter
issues with Unicode canonical equivalence.
However, some languages and their writing systems have features
that expose or rely on canonical equivalence. For example, some
languages make use of combining marks and the order of the
combining marks can vary. Other languages use multiple accent marks
and their input systems may pre-compose or not compose characters
depending on the keyboard layout, operating system, fonts, or the
software used to edit text. Vietnamese is an example of this. Since
canonically equivalent text is (supposed to be) visually
indistinguishable, users typically don't care that (for example)
their Mac uses a different code point sequence than their
neighbor's Windows computer. These languages are sensitive to
canonical equivalence and rely on consistent normalization in order
to be used with a technology such as XML. Further, many of the Ur-
technologies are now used in combination. For example, a site might
use XML for data interchange, XSLT to extract the data into an HTML
page for presentation, CSS to style that page, and AJAX for the
user to interact with the page.
With this potential problem in mind, eleven (!) years ago the I18N
WG started to work on a Specification to address the use of Unicode
in W3C specs. This work is collectively called the "Character Model
for the World Wide Web" or 'CharMod' for short [2]. Initially, the
WG approach was to recommend what was termed "early uniform
normalization" (EUN). In EUN, virtually all content and markup was
supposed to be in a single normalization form, specifically NFC.
The WG identified the need for both document formats (e.g. a file
called 'something.xml') to be in NFC as well as the parsed contents
of the document (individual elements, attributes, or content within
the document). This was called "fully normalized" content.
For this recommendation to work, tools, keyboards, operating
environments, text editors, and so on would need to provide for
normalizing data either on input (as with most European languages)
or when processing, saving, or interacting with the resulting data.
Specifications were expected to require normalization whenever
possible. In cases where normalization wasn't required at the
document format level, de-normalized documents would be 'valid',
but could cause warnings to be issued in a validator or de-
normalized content could be normalized by tools at creation time.
The benefit to this approach was that specifications and certain
classes of implementation could mostly assume that users mostly had
avoided the problems with canonical equivalence by always authoring
documents in the same way. Using this approach, there would be less
need, for example, for CSS Selectors to consider normalization,
since both the style sheet (or other source of the selector) and
the document tree being matched would use the same code point
sequence for the same canonically equivalent text. The user-agent
could just compare strings. In the few cases where this wasn't the
case, the user would be responsible for fixing it, but generally
the user was responsible for carefully having constructed the issue
in the first place (since their tools and formats would normally
have used NFC).
I18N WG worked on this approach for a number of years while
languages with normalization risks began to develop appreciable
computing support and a nascent Web presence. CharMod contained
strong recommendations but not requirements (with a notable
exception that we'll cover in a moment) towards normalization.
Since it didn't matter so long as content remained normalized and
since the most common languages were normalized by default,
specifications generally didn't require any level of normalization
(although they "should" do so), implementations generally ignored
normalization, tools did not implement it, and so forth.
There was one interesting exception to the recommendations in
CharMod. String identity matching *required* (MUST) the use of
normalization (requirement C312). This nod to canonical equivalence
was also ignored by most specs, implementations, and thus content.
It should be noted that CSS Selectors is a string identity matching
case and not merely one of the "SHOULD" cases.
From being a mostly theoretical problem, normalization has become
something that can be demonstrated in real usage scenarios in real
languages. While only a quite small percentage of total content is
affected, it quite directly impacts specific languages [3]. The
I18N WG is engaged in finding out exactly how prevalent this
problem is. It is possible that, despite having become a real
problem, it is still so strictly limited that it can be dealt with
best via other means that spec-level requirements.
In early 2005, the I18N WG decided that EUN as an approach was not
tenable because many so many different system components,
technologies, and so forth would be affected by a "requirement" to
normalize; that some technologies (such as keyboarding systems)
were not up to the task; and that, as a result, content would not
be normalized uniformly. The decision was made to change from
focusing on EUN towards a policy something like:
1. Recommend the use of normalized content ("EUN") as a "best
practice" for content authors and implementers.
2. Since content might not be normalized, require specifications
affected by normalization to address normalization explicitly.
Surprisingly, none of the requirements are actually changed by this
difference in focus. Note that this did not mean that normalization
would be required universally; it only meant that WGs would be
asked to consider the impact or, in some cases, to change their
specification.
In 2008, at the TPAC, I18N WG reviewed the current WD with an eye
towards finally completing the normalization portion of the work (a
separate working group in the Internationalization Activity has
been chartered between 2005 and 2008 to do the work; this working
group expired with no progress and "I18N Core" inherited the
unfinished work). I18N's review revealed that the current document
state was not sufficient for advising spec, content, or
implementation authors about when and how to handle the new "late
(r)" normalization. The same review produced general
acknowledgement that there now existed significant need based on
real content for normalization to be handled by W3C Specs.
At the very end of 2008, I18N WG also reviewed the Selectors-API
draft produced by WebApps. In reviewing this document, the WG noted
that Selectors, upon which API is based, did not address
normalization. Other recent REC-track documents had also been
advised about normalization and had ended up requiring the use of
NFC internally. However, in the case of Selectors-API, the
selectors in question were in CSS3 and were in a late working draft
state. CSS WG responded to this issue and a long thread has
developed on our combined mail lists, in a wiki, and elsewhere.
Over the past two-plus weeks, the I18N WG has solicited advice and
comments from within its own community, from Unicode, and from the
various CSS (style), XML, and HTML communities. We have embarked on
a full-scale review of what position makes the most sense for the
W3C to hold. In our most recent conference call (11 February), we
asked members and our Unicode liaison to gather information on the
overall scope of the problem on the Web today. We also are
gathering information on the impact of different kinds of
normalization recommendation. We had expected to complete our
review at the 11 February concall, but feel we need an additional
week.
There are a few points of emerging consensus within I18N. In
particular, if normalization is required, such a requirement
probably could be limited to identifier tokens, markup, and other
formal parts of document formats. Content itself should not
generally be required to be normalized (a recommendation should
certainly be made and normalization, of course, is always permitted
by users or some process--see Unicode C7), in part because there
exist use cases for de-normalized content.
The other emerging consensus is that canonical equivalence needs to
be dealt with once and for all. WGs should not have the CharMod
sword hanging over them and implementers and content authors should
get clear guidance. During this review, I18N is considering all
possible positions, from "merely" making normalization a best
practice to advocating the retrofitting of normalization to our
core standards (as appropriate, see above).
One of the oft-cited reasons why normalization should not be
introduced is implementation performance. Assuming, for a moment,
that documents are allowed to be canonicalized for processing
purposes, our experience suggests that overall performance impact
can be limited. There exist strategies for checking and normalizing
data that are very efficient, in part owing to the relative rarity
of denormalized data, even in the affected languages. This document
will not attempt to outline the performance cases for or against
normalization, except to note that performance *is* an important
consideration and *must* be addressed.
I hope this summary is helpful in discussing normalization. I want
to raise this issue now so that all of the affected parts of the
W3C community can consider this issue and how it affects their
specifications/implementations/tests/tools/etc. As a consensus
(hopefully) emerges (not just within I18N), we should be in a
position to finally resolve the normalization conundrum best
proceed to create a global Web that works well for all the world's
users.
Kind Regards,
Addison (for I18N)
[1] http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf Unicode
Conformance
[2] http://www.w3.org/TR/charmod-norm/ CharMod-Normalization part
[3] http://lists.w3.org/Archives/Public/public-i18n-core/2009JanMar/
0128.html Ishida example
[a] http://lists.w3.org/Archives/Public/public-i18n-core/
2009JanMar/0182.html and another list
Addison Phillips
Globalization Architect -- Lab126
Chair -- W3C Internationalization WG
Internationalization is not a feature.
It is an architecture.