What Unicode Is (was RE: Inappropriate Proposals FAQ)

2002-07-12 Thread Kenneth Whistler

Suzanne responded:

  Maybe Unicode is more of a shared set of rules that apply to 
  low level data structures surrounding text and its algorithms 
  then a character set.
 
 Sounds like the start of a philosophical debate. 
 
 If Unicode is described as a set of rules, we'll be in a world of hurt.

 (On a serious note, these exceptions are exactly what make writing some
 sort of is and isn't FAQ pretty darned hard. 

Hmm. Since the discussion which started out trying to specify a
few examples of what kinds of entities would be inappropriate to
proffer for encoding as Unicode characters seems to be in danger
of mutating into the recurrent What is Unicode? question,
perhaps its time to start a new thread for the latter.

And now for some ontological ground rules.

When trying to decide what a thing is, it helps not to use
an attribute nominatively, since that encourages people to
privately visualize the noun the attribute is applied to,
but to do so in different ways -- and then to argue past each
other because they are, in the end, talking about different
things.

Unicode is used attributatively of a number of things, and
if we are going to start arguing/discussing what it is, it
would be better to lay out the alternative its a little
more specifically first.

1. The Unicode *Consortium* is a standardization organization.
It started out with a charter to produce a single standard,
but along the way has expanded that charter, in response to
the desire of its membership. In addition to The Unicode
Standard, it now has adopted a terminology that refers to
some of its other publications as Unicode Technical Standards
[UTS], of which two formally exist now: UTS #6 SCSU, and
UTS #10 Unicode Collation Algorithm [UCA].

It is important to keep this straight, because some people,
when they say Unicode are talking about the *organization*,
rather than the Unicode Standard per se. And when people talk
about the standard, they are generally referring to The
Unicode Standard, but the Unicode Consortium is actually
responsible for several standards.

2. The Unicode *Standard* itself is a very complex standard, consisting
of many pieces now. To keep track of just what something like
The Unicode Standard, Version 3.2 means, we now have to
keep web pages enumerating all the parts exactly -- like
components in an assemble-your-own-furniture kit. See:
http://www.unicode.org/unicode/standard/versions/

In any one particular version, the Unicode Standard now consists
of a book publication, some number of web publications
(referred to as Unicode Standard Annexes [UAX]), and a
large number of contributory data files -- some normative and
some informative, some data and some documentation. These
definitions, including the exact list of contributory
data files and their versions, are themselves under tight
control by the Unicode Technical Committee, as they constitute
the very *definition* of the Unicode Standard. It is not
by accident that the version definitions start off now with
the following wording:

The Unicode Standard, Version 3.2.0 is defined by the following
list...

and so on for earlier versions.

3. The Unicode *Book* is a periodic publication, constituting the
central document for any given version of the Unicode *Standard*,
but is by no means the entire standard. The book, in turn,
is very complex, consisting of many chapters and parts, some
of which constitute tightly controlled, normative specification,
and some of which is informative, editorial content.

The book now also exists in an online version (pdf files):
http://www.unicode.org/unicode/uni2book/u2.html
which is *almost* identical to the published hardcover book,
but not quite. (The Introduction is slightly restructured,
the online glossary is restructured and has been added to,
the charts are constructed slightly differently and have
introductory pages of their own, etc.)

4. The Unicode *CCS* [coded character set] is the mapping of the
set of abstract characters contained in the Unicode repertoire
(at any given version) to a bunch of code points in the
Unicode codespace (0x..0x10). Technically speaking, it
is the Unicode *CCS* which is synchronized closely with
ISO/IEC 10646, rather than the Unicode *Standard*. 10646 and
the Unicode CCS have exactly the same coded characters (at
various key synchronization points in their joint publication
histories), but the *text* of the ISO/IEC 10646 standard doesn't
look anything like the *text* of the Unicode Standard, and the
Unicode Standard [sensum #2 above] contains all kinds of
material, both textual and data, that goes far beyond the scope
of 10646. 

There are other standards produced by some national
bodies that are effectively just translations of 10646
(GB 13000 in China, JIS X 0221 in Japan), but the Unicode Standard
is nothing like those.

Finally, the attribute Unicode ... can be applied to all
kinds of other things characteristic of the Unicode Standard,
including algorithms for the 

Re: What Unicode Is (was RE: Inappropriate Proposals FAQ)

2002-07-12 Thread Barry Caplan

At 03:54 PM 7/12/2002 -0700, Kenneth Whistler wrote:
Suzanne responded:

  Maybe Unicode is more of a shared set of rules that apply to 
  low level data structures surrounding text and its algorithms 
  then a character set.

O.k., so now before asserting or denying that Unicode ... is
a shared set of rules, it would be helpful to pin down
first what you are referring to. That might make the ensuing
debate more fruitful.
Actually, it was me, not Suzanne, that called Unicode a shared set of rules. As 
Ferris Bueller once said I'll take the heat for this. 

I was aware of all of the uses of Unicode that you listed. I have no quarrels with any 
of them. They do point to the fact that the word is overloaded with definitions. Which 
means that readers have to choose the appropriate one from the context. The context of 
the statement above is that the Unicode referred to is the Standard, and all 
associated documentation. Not Unicode the Consortia which manages the Standard. Not 
Unicode the way of life :)

I did intend to throw open a debate about the long term future of Unicode the Standard 
and by extension Unicode the Consortia. Since Suzanne is writing What is Unicode and 
is not Unicode FAQ, I think the answer to that is going to be very definitely colored 
by the answer to the related question What will Unicode become?, e.g. Unicode 6.0, 
7.0, 8.0, etc. 

See my previous msg, subject line: Hmm, this evolved into an editorial when I wasn't 
looking :)  for some thoughts on that subject.


Barry Caplan
www.i18n.com