from:"Kenneth Whistler"

Re: Small Latin Letter m with Macron

2003-01-16 Thread Kenneth Whistler

Christoph Päper asked:

 I recently learned in news:de.etc.sprache.deutsch that there has been a
 tradition (in handwritten text more than in print) of writing mm  as only
 one m with a macron above. I can't find any such character in Unicode,
 just  U+1E3F and U+1E41.
 You could of course build something similar with m+U+0305 to resemble the
 look, but that won't become mm (just m or m¯) after a conversion to
 e.g. ISO-8859-1.
 
 Should such a character be added to Unicode (or did I miss it)?

Neither.

Handwritten forms and arbitrary manuscript abbreviations
should not be encoded as characters. The text should just
be represented as m + m. Then, if you wish to *render*
such text in a font which mimics this style of handwriting
and uses such abbreviations, then you would need the font
to ligate mm sequences into a *glyph* showing an m with
an overbar.

To do otherwise, either representing the plain text content
as m, combining-macron or with a newly encoded m-macron
character, would just distort the *content* of the text,
which is what the character encoding should be about.

If and only if an m-macron became a part of the accepted,
general orthography of German would it make sense to start
representing textual content in terms of such a character.
And in such a hypothetical future, you would use
m, combining-macron, because it already exists in
Unicode, and there is no point to encoding another
canonically equivalant precomposed character for that
sequence.

--Ken

Re: U+2047 double question mark collation

2003-01-15 Thread Kenneth Whistler

Vadim,


 I have a problem with creating collation key for U+2047 (double question 
 mark).
 
 Explicit collation keys for this symbol is absent in allkeys.txt.

allkeys.txt in the current version of the Unicode Collation Algorithm
is based on the Unicode *3.1* repertoire. This can be seen in
the references section in UTS #10, where the version is explicitly
listed as allkeys-3.1.1.txt.

U+2047 is a character added to Unicode Version *3.2*.

 
 In UnicodeData.txt this symbol have compatibility decomposition map.
 
 2047: ... :compat 003F 003F: ...

True.

 
 Based on this and as defined in UTR #10 Unicode Collation Algoriphm this 
 symbol must have these collation keys:
 
 003F [*024E.0020.0004]
 003F [*024E.0020.0004]
 
 But in CollationTest_NON_IGNORABLE.txt assumes that symbol have implicit 
 collation key [FBC0.0020.0002] [A047..].

CollationTest_NON_IGNORABLE.txt is also based on the Unicode 3.1
repertoire. For a Unicode 3.1 implementation of collation,
U+2047 is a reserved code point.

This situation, where the allkeys.txt table is slightly out-of-synch
(behind) the ongoing repertoire additions to the Unicode Standard,
is a known problem we are working on.

The Unicode Technical Committee has mandated that the repertoire
for the allkeys.txt table be updated directly to the Unicode 4.0
repertoire, as soon after the release of Unicode 4.0 as
possible. We are trying to do this more or less simultaneously
this time, but there may be a small delay, given the scope
of the upcoming Unicode 4.0 release.

In the meantime, if you need to deal with character additions
for Unicode 3.2 for collation, then you need to handle them
in terms of tailorings from the current allkeys.txt table.

--Ken

RE: h in Greek epigraphy

2002-12-20 Thread Kenneth Whistler


 BTW, the introductory sentence on page 360 of TUS 3 seems strange.  It
 says that IPA includes basic Latin letters and a number of Latin
 letters from other blocks and then puts four Greek letters in the list!
 Should this be changed to something like IPA includes basic Latin
 letters and a number of other Latin and Greek letters?

Noted for fix by the editors.

--Ken

Re: h in Greek epigraphy

2002-12-18 Thread Kenneth Whistler


 My first answer to my correspondent was just use Roman h.  

That would be my suggestion, too. It is available now -- it matches
current practice, and requires no further action.

 A program that was sorting text, or trying to determine what script 
 a word was written in, would get confused by hε̄γεμο̄ν. 

As for sorting -- if you are sorting epigraphical Greek, you likely
need customized tables, anyway. Just add h and treat it
appropriately.

As for determination of script, you need to ask yourself, for
what purpose. If this is something like regular expression
matching, then again, it doesn't matter so much -- you would just
attempt to match against strings containing letters of the Greek
script + h, and you'd get what you expect.
 
 Would this justify a proposal for Greek small letter epigraphical h?

I don't think so. Not unless you can demonstrate that this really
is a distinct character, as opposed to a special usage of the already
existing Latin h -- which is what it seems to be.

--Ken

 
 David

RE: Precomposed Tibetan

2002-12-17 Thread Kenneth Whistler

Peter Lofting asked:

 Presumedly the present proposal of 900+ stacks is a maturation of the 
 same system. And the claim for universality is based on it being able 
 to typeset everything they have published to-date. 

It is based on the Founders system software, as Michael mentioned.

 The question is 
 whether that list of texts is representative of the full literary and 
 linguistic corpus 

It is not.

 or is only a sub-set?

It is. The Chinese delegation admitted that the collection of stacks
was aimed at modern Tibetan use and would not cover literary Tibetan.

This means that in practice systems based on the current Founders
system technology would be restricted in their coverage, and that
Unicode-based systems would have to deal with *both* the precomposed
stacks and with the rest of Tibetan, leading to Hangul-like
normalization nightmares.

 Could the Chinese be asked to provide detailed information on this 
 system and the texts that it has published so we can get an idea of 
 the domain that their stack set covers?

They were asked some questions during the meeting. The correct
way to proceed now is to provide national body feedback on their
proposal. Such feedback can, of course, contain such questions
regarding the intended scope of coverage of the repertoire in
the proposal.

--Ken

 
 Peter Lofting

RE: Precomposed Tibetan

2002-12-17 Thread Kenneth Whistler

Marco commented:

 Another key point, IMHO, is verifying the following claim contained in the
 proposal document:
 
   Tibetan BrdaRten characters are structure-stable characters widely
 used in education, publication, classics documentation including Tibetan
 medicine. The electronic data containing BrdaRten characters are
 estimated beyond billions. Once the Tibetan BrdaRten characters are encoded
  ^
 in BMP, many current systems supporting ISO/IEC10646 will enable Tibetan
 processing without major modification. Therefore, the international standard
 ^^
 Tibetan BrdaRten characters will speed up the standardization and
 digitalization of Tibetan information, keep the consistency of
 implementation level of Tibetan and other scripts, develop the Tibetan
 culture and make the Tibetan culture resources shared by the world.  [BTW,
 billions of what!?]

The Chinese delegation at the WG2 meeting agreed with a restatement of
this as gigabytes of data. Exactly what kind of data, they did not say,
but in principle that could consist of a few medium-size databases. It
almost certainly does not consist of billions of *documents*.

 I'd propose the following:
 
   1. Find all the available technical details about this BrdaRten
 encoding.

One additional detail for people. The BrdaRten stacks are currently
implemented, in the Founders System software in Tibet, as an extension
to GB 2312.

   2. Come up with a precise machine-readable mapping file between
 BrdaRten encoding to *decomposed* Unicode Tibetan, possibly accompanied by a
 sample conversion application.
   Reasons: (a) to make it easy to migrate BrdaRten legacy data to
 Unicode; (b) to easily update existing BrdaRten applications to export
 Unicode text; (c) to easily retrofit new Unicode applications to import
 BrdaRten text.

See the key words without major modification above. If the BrdaRten
stacks were encoded in Unicode, they would automatically become part
of GB 18030 (because of the UTF-like nature of that strange standard).
However, the catch is that the actual code points for Unicode/10646 are
not predictable or controllable by the Chinese NB. That means that the
final code points in GB 18030 are also not predictable -- and almost
certainly are not the same as those used by the current GB 2312 extension
in Tibet. And *that* means that the current characters ... estimated
beyond billions will have to be migrated to a new encoding, anyway,
once the systems are updated to GB 18030. That is the reason for the
quibble word major in the phrase above. All the data will be reencoded,
but the transition GB 2312 + Tibetan extension == GB 18030 containing
Tibetan extension is viewed as just a mapping and not a major system
modification.

The alternative (and even scarier) prospect is that the existing GB 2312
Tibetan extension code points would be forced as is into a new version
of GB 18030, invalidating the mapping for the existing code points,
and creating a completely new version of GB 18030 that would have to
be supported as a different code page from the existing GB 18030. This
would start us down the road to a indefinite number of distinct GB 18030
mappings, probably not properly labeled in interchange, with large numbers
of interoperability problems predictable (and likely to dwarf the JIS
yen sign/backslash kinds of problems). The reason this prospect is even
thinkable is that any existing implementation of the BrdaRten stacks
in a GB 2312 extension would surely be using 2-byte character encodings,
and a transition to 4-byte GB 18030 character encodings would likely
disrupt the existing implementations significantly.

The question for Unicoders is whether introduction of significant
normalization problems into Tibetan (for everyone) is a worthwhile tradeoff
for this claimed legacy ease of transition for one system, when it is
clear that all existing legacy data using these precomposed stacks is
going to have to either be reencoded anyway (or surrounded by migration
filters for new systems).

--Ken

Re: Localized names of character ranges

2002-12-03 Thread Kenneth Whistler

Doug, seconding a suggestion by Marco, wrote:

 I agree
 that a multilingual Unicode glossary should be assembled (possibly as a
 volunteer project) and officially endorsed by the Unicode Consortium, so
 users and vendors will be on common terminological ground.

In general, I favor such an activity, although at the moment
it would have to be something done by outside volunteers, as
the UTC editorial committee doesn't have the bandwidth now
(in the crunch for Unicode 4.0) to undertake more open-ended
responsibilities.

My caution, however, is that the terminology used by the
Unicode Standard is still evolving -- as witness the ongoing
arguments about some of the terminology related to the
character encoding model. The glossary in Unicode 4.0 will
be substantially revised in some of the key points having
a bearing on the Unicode encoding model. And as more content
is added to the standard, additional terms keep accumulating
in the glossary as well.

And it will be some time before the online glossary can be
completely synched back up with the Unicode 4.0 glossary.

Once people start maintaining a multilingual glossary
based on the online glossary (or supplemented from other
sources), the burden of maintenance will escalate rapidly
for any change introduced to terminology. These things only
work if there is an ongoing institutional commitment to
maintenance and updates. Otherwise all the translated
versions start to get out-of-synch quickly, both with
the English original and with each other. This can lead
to dangerous misunderstandings among people who assume that
their own translated version is accurate.

So if anyone wants to undertake such an effort, don't
forget to provide for ongoing maintenance and for the
fact that eager volunteers tend to drop like flies when
repeatedly forced to update their work at irregular
intervals.

--Ken

Re: Default properties for PUA characters???

2002-12-02 Thread Kenneth Whistler

Christian Wittern asked:

 Leaving aside the red light that flashed in my head on the notion of
 the W3C recommending PUA (for interchange?), I was wondering about the
 notion of PUA characters being by Unicode defaults treated as
 ideographs.  Is there a canonical reference for this?
 
 Just wondering,

Many Unicode character properties are actually code point
properties. They must partition the entire Unicode codespace,
so that an API can return a meaningful value for any code
point, including PUA and unassigned code points, not just
for assigned characters.

Because of this, the Unicode Standard now has a concept of
a default property value, which applies in code points which
are not otherwise given an explicit value for that property.

In the case of PUA characters, the Unicode Character Database
gives them all the same properties. Some of the most important of
those properties are:

gc=Co   (general category = Private_Use)
ccc=0   (combining class = 0, i.e. Not_Reordered)
bc=L(bidi class = strong Left_To_Right)
sc=Zyyy (script = Common)
lb=XX   (line break = Unknown)
ea=A(east asian width = Ambiguous)

For ideographs, which also all have the same properties, the
relevant, corresponding properties are:

gc=Lo   (general category = Other_Letter)
ccc=0   (combining class = 0, i.e. Not_Reordered)
bc=L(bidi class = strong Left_To_Right)
sc=Hani (script = Han)
lb=ID   (line break = Ideographic)
ea=W(east asian width = Wide)

Thus, while in some respects the PUA characters are, by default,
like ideographs (they are all base characters and are treated
as left-to-right for bidi purposes), in other respects, their
properties differ.

In particular, with respect to line-breaking, UAX #14 currently
states for lb=XX:

The default behavior for [XX] is identical to class AL.
[i.e. alphabetic characters] ... In addition, implementations
can override or tailor this default behavior, e.g. by
assigning characters the property ID or another class, if that
is likely to give the correct default behavior for their users,
or use other means to determine the correct behavior. For example,
one implementation might treat any private use character in
ideographic context as ID, while another implementation
might support a method for assigning specific properties to
specific definitions of private use characters. The details of
such use of private use charaters are outside the scope of this
standard.

So I'd say that the XML Core WG has got the situation only
partially correct for Unicode PUA characters.

--Ken

Re: mixed-script writing systems

2002-11-26 Thread Kenneth Whistler

Dean Snyder asked:

  ...
 What it comes down to is the fact that for historic scripts in
 particular, there are no defined criteria that would enable us
 to simply *discover* the right answer regarding the identity of
 scripts. To a certain extent, the encoding committees need to
 make arbitrary partitions of historic alphabets through time
 and space, reflecting scholarly praxis as far as feasible, and
 then live with the results. At least this procedure makes it
 *possible* to represent the texts reliably, once the scripts
 and their variants have been standardized.
 
 What are the criteria used to make these arbitrary partitions? 

I have to return to my statement above. There are no defined
criteria -- at least not in the sense of some formally defined
set of criteria which could be objectively applied by
graphologists to come up with the right answer.

As for many issues, particularly regarding ancient systems,
there are a lot of historical contingencies which intervene --
what attestations managed to survive and what kinds of material
they consist of. And equally important may be the particular
twists and turns that analysis of the materials took.
Writing systems which require long, problematical, and in
some cases uncertain decipherments may end up with different
encoding needs than systems where the nature of the units
may not be at issue. And answers may depend on the nature of
the historic *successors* of the attestations as well,
since boundaries between systems and the nature of the
encoding decided upon may then be influenced by the encoding
of the successor systems.

 What is
 determinative of scholarly praxis? 

Consensus among the expert practitioners.

The character encoding committees make an effort to ensure
that there is some evidence of such consensus, when expert
opinion is available. Otherwise there would be little
point in attempting to standardize character encoding.

In the case of Sumero-Akkadian, it seems to me that there
was, for example, some evident consensus among experts that
it made sense to specify that as a script for encoding,
leaving open the question of where to draw the boundary
for early Sumerian on the one hand, and differentiating
later adaptations of cuneiform which were clearly
not Sumero-Akkadian per se, such as Ugaritic.

But if that is *not* the consensus among Assyriologists,
then any determination as to where to draw the boundaries
would have to await the emergence of such consensus.

 And would not some or all of the
 examples I give above be governed by such criteria?

I think your examples were seeking formal logical criteria.

But my point is that writing systems and scripts are
both holistic systems and fuzzy around the edges. The
best way to find them is not to seek formal logical
criteria, but instead to find *experts* who know them
and ask them to point them out.

If I am a novice wondering through a new forest, and
need to tell the trees in the forest apart (as opposed
to the forest from the trees :-) ), it is much easier
*and* more accurate to get an expert to tell me,
That's a madrone, that's a bay laurel, that's a
coastal live oak, that's a big leaf maple, ... than
it is to ask the expert (or anyone else) to draw
up a foolproof set of taxonomic criteria whereby I
can deal with all the edge cases (including the hybrids).

--Ken

Re: ISO 10646, Unicode The FAQ (Bengali Khanda Ta)

2002-11-21 Thread Kenneth Whistler

Rick investigated, and came up with:

 In a specific case, Andy asked about Khanda Ta, and pointed to a WG2  
 resolution that contradicts the Unicode FAQ on the same topic. I looked up  
 a paper listing an action item as follows, taken from document  
 http://anubis.dkuug.dk/JTC1/SC2/WG2/docs/M40ActionItems.pdf which are the  
 action items from meeting #40 of WG2; the decision was from meeting #39 in  
 October 2000:
 
 Resolution M39.11 (Request from Bangladesh): In response to the
 request from Bangladesh Standards and Testing Institution in
 document N2261 for adding KHANDATA character to 10646, WG2 instructs
 its convener to communicate to the BSTI: a. that the requested
 character can be encoded in 10646 using the following combining
 sequence: Bengali TA (U+09A4 ) + Bengali Virama (U+09CD) + ZWNJ
 (U+200C) + Following Character(s), to be able to separate the
 KHANDATA from forming a conjunct with the Following Character(s).
 Therefore, their proposal is not accepted. b. our understanding
 that BDS 1520: 2000 completely replaces the BDS 1520: 1997.
 
 That does indeed give a different answer than the Unicode FAQ.
 
 I wonder if anyone else knows whether the text of 10646 contains any  
 mention of Khanda Ta, and if so, what it says.

It does not mention Khanda Ta.

And I guess it's time to open that old CBS (character BS) mailbag
to track this sucker down.

Resolution M39.11 dates from the WG2 discussion of September 20, 2000
(at the WG2 meeting in Vouliagmeni, Greece). It was agenda item 7.12
at that meeting, Proposal to synchronize Bengali standard with 10646,
during which the question came up about what is this KHANDATA thing 
in Bengali BDS 1520:2000 standard
anyway, and should it be encoded as a separate character, as it was
(at code point 0xBA) in BDS 1520:2000.
For details of the discussion, see the WG2 meeting minutes, online
in WG2 N2253.

The upshot of the initial discussion was that Michael Everson was
tasked with an action item, to wit:

Michael Everson to contact BSTI (email id, name etc. are in the cover
letter) - a query was sent out to Unicode expert's list also.

The response received to the query to the Unicode list on September 20
from a Mr. Abdul Malik seemed to answer the question of what the
KHANDATA was. Anyone who wants to can dig it out of the Unicode email
archives: X-UML-Sequence: 16066 (2000-09-20 16:22:21 GMT). But the
relevant portions of the email were:

quote

- Original Message -
From: Michael Everson [EMAIL PROTECTED]
To: Unicode List [EMAIL PROTECTED]
Sent: Wednesday, September 20, 2000 10:30 AM
Subject: Request about Bengali/Bangla

 BDS 1520:2000 contains a BANGLA LETTER KHANDATA and it has been proposed
 for addition to the UCS. I am at the WG2 meetings in Athens where the
 character is being discussed, but we don't know how to evaluate it.

A representative of the Bangladesh Standards and Testing Institution (the
instigator of the proposal) should be better placed to answering these
questions than me, anyway...

 What is this character and how is it used?

KhandoTa is a form of the letter Ta. It is the form Ta takes when it has no
inherent vowel. It occurs when final and medial, but never the initial
letter of a word. It is equivalent to Ta virama. Ta with a visible virama is
only needed for illustrative purposes, kandaTa being used in its place in
all Bengali words, except when it forms a conjunct form.

For example in a standard without KhandaTa, there are two different forms
the sequence Ta Virama Ma need to take i.e. khandoTa_Ma or the
Ta/Ma_conjunct_form. As BSD1520:2000 does not include any ligation control
characters other than Virama, it is necessary to include KhandaTa as a
separate letter to make the two previously mentioned forms.

 Another question, is does BDS 1520:2000 completely replace BDS 1520:1997,
 or is the old standard still valid (and being implemented)?

BDS 1520:1997 is based on a font encoding. It is the standard currently used
in the products of Proshika Computer Systems and AdarshaBangla Technologies
Inc. It is also the encoding used in many web sites.
BDS 1520:2000 is a complete replacement, being based on the ISO/IEC10646
character encoding model. AFAIK it is yet to receive a real world
implementation.
BDS 1520:2000 seems immature as it does not include any encoding principles
or rendering rules, for example, how is Bengali zophola to be formed? Is it
formed from Ya or YYa?

 What are the implications for interoperability between this standard and
ISCII standards?

As BDS 1520 does not currently have an encoding model to refer to, one can
not say. e.g. to form Ka_halant Ka:
in Unicode :- Ka virama ZWNJ Ka
In ISCII :- KA Virama Virama Ka
In BDS :- ??

Regards

Abdul

/quote

It was on the basis of *this* feedback from a Bengali expert on
the Unicode list, reported back by Michael Everson to the WG2 meeting,
that WG2 drafted a resolution responding to the request by BSTI
expressed in

Re: Lowercase numerals

2002-11-20 Thread Kenneth Whistler

Doug Ewell answered:

 Thomas Lotze thomas dot lotze at uni dash jena dot de wrote:
 
  Why is it that while there are both uppercase and lowercase roman
  numerals in the Unicode character set (in the Number Forms range), no
  lowercase arabic numerals (old-style or text figures) are encoded? If
  they are regarded as presentation forms of the uppercase numerals (in
  the Basic Latin range), why is this not the case for their roman
  counterparts?
 
 Because oldstyle numerals aren't really lowercase in the same sense as
 small letters (though some typographers think of them that way; see
 [1]).  They're just glyph variants of the uniform-height lining
 numerals, so yeah, it's a character-glyph thing.

And to complete the answer for Thomas, the Roman numerals are
based on Latin letters, which *do* have upper/lowercase distinctions,
unlike digits. The compatibility Roman numerals in the Unicode
Standard (U+2160..U+217F) are derived from East Asian standards
which separately encoded upper- and lowercase forms, so would have
been required to be separately encoded just for compatibility
anyway.

--Ken

Re: mixed-script writing systems

2002-11-18 Thread Kenneth Whistler

Andrew West wrote:

 On Mon, 18 Nov 2002 02:34:18 -0800 (PST), Kenneth Whistler wrote:
 
  In point of fact,
  people for centuries have been borrowing back and forth between
  Latin, Greek, and Cyrillic in particular, so that in some respects
  LGC is a kind of metascript and should be treated as such.
  
 
 Latin, Greek, Cyrillic and Runic even (cf. Latin letters Thorn and Wynn).

Point taken. And don't forget Old Italic, which is now encoded as well.

 
 Gothic is a good example of a mixed-script writing system,

Not really -- a good example, that is.

 composed of a mixture
 of Latin, Greek and Runic letters. There is a Gothicness about the graphic
 forms of the glyphs of the Gothic alphabet, but IMHO this variation from
 standard (but what is standard in 4th century terms ?) Latin, Greek and
 Runic letters should be dealt with at the font level. 

It isn't particularly helpful to go there, since it doesn't fit all that
well as merely a font variant of Latin or Greek or Runic. Certainly
it *could* be done that way, but for this particular case, the
committees were convinced that simply laying out Gothic as a distinct
script was more practical.

As it stands now, the Gothic bible can be correctly and unambiguously
represented in Unicode, using the Gothic script as defined. Not to
have encoded the Gothic script would have left us still arguing about
which letters from which script to use and how Gothic fonts should
be encoded.

 Nevertheless, Gothic has
 been encoded in Unicode, and this may provide an unwelcome precedent for
 encoding other mixed-script writing systems.

What you are getting at is the complicated problem of sorting out all
the historical connections between various related alphabets and trying
to sift them into categories which make sense as scripts and categories
which are simply font variants within a script. For modern scripts this
is less of a problem, since we have modern practice and typography to
rely on to help make the distinctions. For *historic* scripts, on the
other hand, it is murkier.

Old Italic is a good case in point. It *could* have been treated as
another archaic outlier of Greek. The problem with that is that it
would have added a few more archaic letters which never show up in
modern Greek fonts, and it would have forced distinct archaic fonts
to be able to represent Old Italic text reliably. Old Italic texts
don't get rendered with a modern Greek font -- it would look
ridiculous. Because of this usage pattern, it made sense to the
committees to coalesce the various southern Old Italic alphabets
(Oscan, Umbrian, Messapian, etc.) into a script which would incorporate
all the required letters for those alphabets, as *opposed* to Latin
or to Greek per se. It is likely that a similar decision will be
taken in the future to account for the Alpine alphabets of northern
Italy, which are intermediate between Italic and Runic alphabets.

What it comes down to is the fact that for historic scripts in
particular, there are no defined criteria that would enable us
to simply *discover* the right answer regarding the identity of
scripts. To a certain extent, the encoding committees need to
make arbitrary partitions of historic alphabets through time
and space, reflecting scholarly praxis as far as feasible, and
then live with the results. At least this procedure makes it
*possible* to represent the texts reliably, once the scripts
and their variants have been standardized.

 
 What about the now-defunct Zhuang alphabet (used between 1955 and 1981 in PRC)
 that was composed of a cumbersome mixture of Latin, Cyrillic and IPA letters ?
 Should the letters of this alphabet be encoded separately in Zhuang block, 

Check the standard:

U+0185 LATIN SMALL LETTER TONE SIX
U+019C LATIN CAPITAL LETTER TURNED M
U+01A8 LATIN SMALL LETTER TONE TWO
etc.

This issue was decided already in 1989.

 or
 is it simply the fact that the borrowed letters do not exhibit any distinctive
 Zhuangness in their graphic form that precludes their being encoded separately
 in the same way that Gothic is ? (Or is it perhaps a Eurocentric bias in Unicode
 ?)

It is getting rather tiresome to have Eurocentric bias brandished
as a disparagement of an encoding standard, 87% of whose content consists
of Han or Hangul characters, and whose maintaining committees are busy
finalizing the addition of Limbu, Tai Le, Osmanya, Ugaritic Cuneiform,
and Linear B. The UTC met just last week, and voted to start the process
of adding the Karoshti script. Yeah, definitely a Eurocentric bias
detectable there in that collection of additions.

--Ken

 
 Andrew

Re: The result of the Plane 14 tag characters review

2002-11-18 Thread Kenneth Whistler

James Kass said:

 How do these differences apply to Unicode plain text and the
 Plane 14 tags?  For example, it was noted that the ideographic full 
 stop is centered in Chinese text but sits on the baseline (and isn't 
 centered) in Japanese text.

This claim about ideographic periods is untrue. Chinese typography
uses both conventions. Older, traditional typography (but still
already Western-adapted in using horizontal layout) uses the
centered ideographic full stops (e.g., 1971 dictionary published
in Taipei). Modern typography uses the baseline, left-set ideographic
full stops (e.g., 1997 simplified Chinese dictionary published in Beijing,
2002 simplified Chinese newspaper published in Burlingame, California!).
It is a matter of typographic style and historic period, *not* of
language.

*Really* traditional classical Chinese text doesn't use an ideographic full
stop at all. Typical material might be set vertically, with left
sidelining serving the highlighting function that bolding or italics
would do in Latin text, and with furigana-style punctuation dropped
in annotationally on the right side of the vertical lines of text.
[Just to make things difficult, *that* Chinese, while still Chinese,
is clearly a distinct language from modern (Mandarin) Chinese, as
distinct from it as Chaucer's English is from modern (American)
English.]

 Without a plain text method of distinguishing the writing system
 for a run of text, a plain text file wouldn't be able to be 
 correctly displayed if it had both Japanese and Chinese text. 

Of course it would.

Go to any Japanese newspaper. There is no required change of
typographic style when Chinese names and placenames are mentioned
in the context of Japanese articles about China.

Go to any Chinese newspaper. There is no required change of
typographic style when Japanese names and placenames are mentioned
in the context of Chinese articles about Japan.

These is completely comparable to the fact that my local
English-language newspaper doesn't need a German language tag
to write Gerhard Schroeder.

--Ken
 
 (Ideographic variants notwithstanding.)
 
 Best regards,
 
 James Kass.

Re: The result of the Plane 14 tag characters review

2002-11-18 Thread Kenneth Whistler

Michael Everson asked:

 At 13:37 -0800 2002-11-18, Kenneth Whistler wrote:
 
 Go to any Japanese newspaper. There is no required change of
 typographic style when Chinese names and placenames are mentioned
 in the context of Japanese articles about China.
 
 Go to any Chinese newspaper. There is no required change of
 typographic style when Japanese names and placenames are mentioned
 in the context of Chinese articles about Japan.
 
 Just to be sure: this means that when a Japanese newspaper it uses 
 the glyphs its readers prefer for Chinese names, not glyphs which 
 Chinese readers may prefer? 

Yes. For obvious reasons.

 Does this extend to the 
 Simplified/Traditional instance, so that if a Chinese name has the 
 word for horse in it, it uses the Japanese glyph for horse,not either 
 the S or T version of the glyph (assuming for the sake of argument 
 that both occur and that both are different from the preferred 
 Japanese glyph)?

Yes. Example: The once president of the ROC, known in English as
Chiang Kai-shek, has a surname which shows several variants.

Traditional Chinese: U+8523
Simplified Chinese:  U+848B

Japanese prefers a different, traditional simplification of the
glyph for U+848B. You can see the difference in the Unicode 3.0 book
charts if you look up U+848B in the charts (p. 693), and then look up
the corresponding 0x8FD3 in the Shift-JIS Index (p. 931).

In a Japanese newspaper, the Japanese-style of U+848B will be
present in the font. If the source is from a simplified Chinese
rendition of Chiang Kai-shek, then the Japanese presentation will
simply be the same character, Japanese style. If the source were
from a traditional Chinese rendition, then the Japanese presentation
would also represent a respelling of the name from U+8523 to
U+848B (comparable to Schröder -- Schroeder) to get it to use
a character for which the appropriate Japanese presentational form
is available.

In any case, once the correct spelling is settled on, there is
no *stylistic* variation from the rest of the text for the Chinese 
name embedded in Japanese text . It is clearly
recognized in text as an alien, i.e., non-Japanese name, and no
attempt would be made to give it a Japanese name reading, but that
is merely by virtue of the reader's recognition that
U+848B, U+4ECB, U+77F3 is a famous Chinese person -- and would
be sounded out as Shoo Kaiseki (not *Makomo Sukeishi or some other
putative Japanese name).

--Ken

Re: The result of the Plane 14 tag characters review

2002-11-18 Thread Kenneth Whistler


  These is completely comparable to the fact that my local
  English-language newspaper doesn't need a German language tag
  to write Gerhard Schroeder.
 
 How about a multilingual newspaper?

What of a multilingual newspaper?

Take a hypothetical instance of a German/English newspaper,
which presented all the news twice -- once in German, and
again in English. So the German side says, for example:

 Nach einem 19 Monate dauernden Stillstand im Nahost-Friedensprozeß
 und einem zähen achttägigen Verhandlungsmarathon bei Washington
 haben sich Israels Ministerpräsident Netanjahu und der Vorsitzende
 der palästinensischen Autonomiebehörde, Arafat, in einer langem
 Sitzung in der Nacht zum Freitag auf ein Interimsabkommen über
 ,,Land für Sicherheit`` geeignigt...
 
Then the English side would say:

 After a 19 month pause in the Middle East peace process... etc.
 
In such a case, it would make sense to tag the *entire* German
text as German, and the *entire* English text as English (and
it would probably be done so in terms of markup in any case).

But it would make no particular sense to start digging into
the material and tagging Washington as English (although it
is) and Israel and Netanjahu as Hebrew (although they
are) and Arafat as Arabic (although it is). Embedded quotations
of untranslated material, if they occur, perhaps.

Well, Chinese and Japanese work the same way. You do whatever
adaptation of the names are required for your local language,
and then you present them as expected to the reader of *that*
language. So, in the above example, Netanjahu for the German
reader, Netanyahu for the English reader -- but in neither
case presented in the original Hebrew. (In fact, for German,
you will also commonly find it spelled Netanyahu -- but you
won't find it in Hebrew.)

--Ken

Re: mixed-script writing systems

2002-11-15 Thread Kenneth Whistler


 So, the question is this: Should we say that this writing system is
 completely Latin (keeping the norm that orthographic writing systems use a
 single script) and apply the principle of unification -- across languages
 but not across scripts -- to imply that we need to encode new characters,
 Latin delta, Latin theta and Latin yeru? Or, do we say that this writing
 system is only *mostly* Latin-based, and that it mixes in a few characters
 from other scripts?

If everyone can hold off on the Kurdish rhetoric for the moment,
it should be clear that such mixed orthographies as Peter has
shown in Wakhi are best handled by simply using the characters
that are already encoded, rather than cloning more and more
characters into Latin, Greek, and Cyrillic to deal with the
artificial constraint that would claim that any LGC-based
alphabet *must* consist only of a single script. In point of fact,
people for centuries have been borrowing back and forth between
Latin, Greek, and Cyrillic in particular, so that in some respects
LGC is a kind of metascript and should be treated as such.

Note that we will run across many other examples of such cross-script
LGC letter borrowings in various oddball orthographies. One I
happen to know about is the publication by Morris Swadesh of
extensive texts of Wakashan languages using Cyrillic che (U+0447)
in the midst of otherwise Latin letters for what most Americanists
would currently use Latin c-hacek (U+010D) instead.

It isn't doing anyone any favors to keep cloning such cross-script
borrowings into the character encoding standard, *unless* there
is strong evidence of script-specific adaptation of the letters
after their borrowing. The handling of Latin Q in the otherwise
Cyrillic Kurdish alphabet is what makes it the marginal case it
is and argues for encoding of a separate Cyrillic Q. I do not,
however, believe that such arguments apply to cases such as
this Wakhi instance, unless Peter or someone else could demonstrate
specific Latin-scriptfication of the borrowed letters in the
orthography.

--Ken

Re: The result of the plane 14 tag characters review.

2002-11-12 Thread Kenneth Whistler

William Overington asked:

 As the Unicode Consortium invited public comments on the possible
 deprecation of plane 14 tag characters, will the Unicode Consortium be
 making a prompt public statement of the result of the review as soon as the
 present meeting of the Unicode Technical Committee is completed, or even
 earlier if the decision of the Unicode Technical Committee has already been
 finalized?

*the Unicode Consortium spokesman steps up to the press conference podium*

*the press surges forward eagerly*

*flashbulbs start to pop*

Ahem...

The Unicode Technical Committee would like to announce that no
formal decision has been taken regarding the deprecation of
Plane 14 language tag characters. The period for public review of
this issue will be extended until February 14, 2003.

*hands are waved vigorously*

*microphones are shoved forward with loud questions*

I'm sorry... No..., No..., I have no further response at this time.

*the Unicode Consortium spokesman retires hurriedly, followed closely
  by two burly bodyguards*

Re: In defense of Plane 14 language tags (long)

2002-11-12 Thread Kenneth Whistler

David Hopwood said:

 Note that if deprecation implies no longer treating these characters
 as ignorables, 

It would not.

The only character *property* implication that deprecation of
Plane 14 language tags (or any other characters) would have is
the requirement that they gain the Deprecated property. (See
PropList.txt in the Unicode Character Database.)

 then that causes new software that sees existing data using
 plane 14 tags to break (to some extent; probably not fatally). OTOH, if
 deprecation does not imply treating plane 14 tags as ignorables, then
 nothing is gained: the complexity of filtering is still there, but the
 characters can't actually be used.

Deprecation in the Unicode Standard does not mean that characters
cannot actually be used. In fact, many generic implementations,
such as low-level libraries which report character properties, will
continue to implement them, precisely because higher-level processes
will need to know that the code points in question *are* deprecated
(along with whatever other properties they may have).

What deprecation in the Unicode Standard means, basically, is that
a particular character or set of characters is noted as a horrible
encoding mistake, and that any implementer in their right mind would look
to use the suggested alternatives as a better way to approach whatever
misguided goal the deprecated characters were originally intended to
achieve.

As Asmus put it:

Since we can't remove them, we would
deprecate them, so that countless legions of implementers can forget worrying
about a feature deemed desirable but never put into practice.

--Ken

P.S. I have to agree with John Hudson, Asmus, and others that the
issue is not about the usefulness of language tagging per se, but
whether Plane 14 language tag characters themselves, as currently
defined, are an appropriate mechanism for indicating language tags
in Unicode (supposedly) plain text. Doug's contribution would be
more convincing if it dropped away the irrelevancies about whether
the *function* of language tagging is useful and focussed completely
on the appropriateness of this *particular* set of characters on
Plane 14 as opposed to any other means of conveying the same
distinctions.

Re: Names for UTF-8 with and without BOM

2002-11-01 Thread Kenneth Whistler

 Perhaps it
 is time to think of three other words starting with B, O, M that make a
 better explanation.)

Bollixed Operational Muddle ;-)

--Ken

RE: New Charakter Proposal

2002-10-30 Thread Kenneth Whistler

Dominikus Scherkl replied to Markus:

   My other suggestion (and the main reason to call the proposed
   charakter source failure indicator symbol (SFIS)) was intended
   especaly for mall-formed utf-8 input that has overlong encodings.
  This is a special, custom form of error handling - why assign 
  a character for it?
 Converting from and to utf-8 is an all-day topic, very important
 for all applications handling with unicode. So it is a special
 case, but very common.
 Therefore it would be nice to have a standardized - application
 independend - error handling for it. Also it is a mechanism
 useful for many other charsets beeing converted do unicode.

I've got to agree with Markus here. Among other things, encoding
a character which means conversion failure occurred here and
then embedding it in converted text is just a generic and
not very informative way of *representing* a conversion failure.
The actual error handling would still end up being up to the
application, every bit as much as what an application does
today with a U+FFFD in Unicode text is application-specific.

Adding this kind of character would then also complicate the
task of people trying to figure out how to write convertors,
since they would then be scratching their heads to distinguish
between cases which warrant use of U+FFFD and those which
warrant this new SFIS instead. Maybe the distinction seems clear
to you, but I suspect that in practice people will become
confused about the distinctions, and there will be troubling
edge cases.

In the particular case of UTF-8, I would consider such a
mechanism nothing more than an attempted endrun around the
tightened definition of UTF-8. It provides another path
whereby ill-formed UTF-8 could get converted and then end
up being interpreted by some process that doesn't know
the difference. In other words, it carries the risk of
reintroducing the security issue that we've been trying to
get legislated away, by finding a way to make it o.k. to
interpret non-shortest UTF-8.

  You could just use an existing character or non-character for 
  this, e.g., U+303E or U+ or U+FDEF or similar.
 This is what I do meanwhile. But it's uncomfortable, because
 most editors display all non-characters, unassigned characters
 or charakters not in the font all the same way - which hides
 the INDICATION. The SFIS should be displayed to remind the reader
 only THIS is a SFIS unlike all the other empty suqares in the
 text.

Your suggested encoding U+FFF8 wouldn't work this way, by the
way. U+FFF8 is reserved for format control characters -- and
those characters display *invisibly* by default -- not as
an empty square (or other fallback glyph) like miscellaneous
symbols which happen not to be in your fonts.

I think Marku's suggestion is correct. If you want to do
something like this internally to a process, use a noncharacter
code point for it. If you want to have visible display of this
kind of error handling for conversion, then simply declare a
convention for the use of an already existing character.
My suggestion would be: U+2620. ;-) Then get people to share
your convention.

I'm not intending to be facetious here, by the way. One problem
that character encoding runs into is that there are plenty
of people with good ideas for encoding meanings or functions,
and those ideas can end up turning into requests to encode
some invented character just for that meaning or function.
For example, I might decide that it was a good idea to have
a symbol by which I could mark a following date string as
indicating a death date--that would be handy for bibliographies
and other reference works. Now I could come to the Unicode
Consortium and ask for encoding of U+ DEATH DATE SYMBOL,
or I could instead discover that U+2020 DAGGER is already used
in that meaning for some conventions. There are *plenty* of
symbol characters available in Unicode -- way more than in
any other character encoding standard. And it is a much
lighter-weight process to establish a convention for use
of an existing symbol character than it is to encode a new character
specifically for that meaning/function and then force everyone
to implement it as a new character.

 Additional I think we should have a standardized way to display
 old utf-8 text without losing information (overlong utf-8 was
 allowed for years) 

Not really. And in any case, there is nothing to be gained
here by displaying old utf-8 text without losing information.
The way to deal with that is to *filter* it into legal
UTF-8 text, by means of an explicit process designed to
recover what would otherwise we rejected as illegal data.

 - gyphing is not a fine way and simply
 decoding the overlong forms is not allowed. This is a self-made
 problem, so unicode should provide an inherent way to solve it.

There are plenty of ways to solve these things -- by API design
or by specialized conversions designed to deal with otherwise
unrepresentable data. But trying to bake conversion

RE: Character identities

2002-10-29 Thread Kenneth Whistler

Michael asked:

 My eyes have glazed over reading this discussion. What am I being 
 asked to agree with?

Here's the executive summary for those without the time to
plow through the longer exchange:

Marco: It is o.k. (in a German-specific context) to display
   an umlaut as a macron (or a tilde, or a little e above),
   since that is what Germans do.
   
Kent:  It is *not* o.k. -- that constitutes changing a character.

[Sorry, guys, if I have ridden roughshod over the nuances... ;-)]

Michael, you might have to recuse yourself, however, since when
it was suggested that displaying Devanagari characters with
snowpeaked glyphs for a Nepali hiking company would be o.k.,
you misunderstood and suggested private use characters!

--Ken

Re: Character identities

2002-10-28 Thread Kenneth Whistler


 Hm, what if I want to make, say, snow capped Devanagari glyphs for my
 hiking company in Nepal? Shouldn't I assign them to Unicode code points?
 
 That's what Private Use code positions are for.
 -- 
 Michael Everson * * Everson Typography *  * http://www.evertype.com

Um, Michael, I think Anto'nio was talking about glyphs in a
decorative font, which should -- clearly -- just be mapped to
ordinary Unicode characters, via an ordinary Unicode cmap.

Or do you think that the yellow, cursive, shadow-dropped, 3-D
letters Getaway! at:

http://www.trekking-in-nepal.com/

should also be represented by Private Use code positions? ;-)

--Ken

Re: Origin of the term i18n

2002-10-14 Thread Kenneth Whistler


Raymond Mercier asked:

 Isn't i18n rather off-list ?

Neither Sarasvati nor the self-styled list police have objected.

While historical origin discussions are OT, they do seem to have
an interested following on the Unicode list.

Perhaps more to the point, Unicode implementations are all about
i18n (or internationalization -- however you want to spell it).
And the UTC and L2 committees consider internationalization to be
a part of their overall area of concern. And the Unicode conferences
definitely cover internationalization issues -- and even some of
the details of localization.

 Is this the same list where people objected to the endless arguments with 
 William Overington ?

Yep. But at least nobody on this thread -- to date -- has claimed
a new invention, proposed to encode i18n in user space, or
proposed lyrics about it to be posted in their family webspace.

--Ken ;-)

Re: Origin of the term i18n

2002-10-11 Thread Kenneth Whistler


 Sorry to appear the curmudgeon, but 
  ^^
recte: c8n

--K1n

Re: Origin of the term i18n

2002-10-11 Thread Kenneth Whistler

Mark,

  Mark, I am curious why you find this term so distasteful? Is it the
  algorithm itself or just a general objection to acronyms and the like? Or
  something else entirely?
 
 I find this particular way of forming abbreviations particularly ugly and
 obscure. It is also usually unnecessary; looking at any of the messages
 brought up by Google, the percentage of 'saved' keystrokes is a very small
 proportion of the total count. And when it leaks out into the general
 programmer community, it just looks odd.
 
 For me, it is on the same order as using nite for night, or cpy for
 copy.

u shuld just be glad u wont live to see the day when netspeak roolz
and ur goofy language is rOxXoRed!

--K1n

Re: Historians- what is origin of i18n, l10n, etc.?

2002-10-10 Thread Kenneth Whistler



 W0e n3r u2d t1e g1d-a3l, g3y a1d o5e a10n i18n, h5r!

What I don't understand, since these a10n's are in such
widespread use among programmers and character encoders,
is why they don't use h9l, as in i12n, lan, and gbn?

--K1n

BTW, these aan's are not only o5e, they are also o4e, but
unfortunately, not o6e in use.

Re: ISO 8859-11 (Thai) cross-mapping table

2002-10-07 Thread Kenneth Whistler


Elliotte Harold asked:

 The Unicode data files at 
 http://www.unicode.org/Public/MAPPINGS/ISO8859/ do not include a mapping 
 for ISO-8859-11, Thai. Is there any particular reason for this? 

Just that nobody got around to submitting and posting one.

Since there was a lot of discussion about this over the weekend,
I took it upon myself to create and post one in the same format
as the other ISO8859 tables.

Let me know if anybody spots any problems in the table -- but
it really is pretty straightforward, as others noted: TIS 620-2533 (1990)
with one addition: 0xA0 NO-BREAK SPACE.

Doug dug out:

 These 9 code positions (0xA0, 0xDB..0xDE, 0xFC..0xFF) appear to be
 undefined in TIS 620.2533.  Reference [3] below does show a word
 separator character at 0xDC, which I interpret as U+200B ZERO WIDTH
 SPACE, but the other positions are still undefined.

Reference [3] is online Tru64 Unix documentation about its Thai support,
which claims that:

- No-Break space. The character code is A0.
 ...
 - Word separator. The word separator defined in TIS 620-2533.

This despite the fact that the table shown has no no-break space
shown at A0 (and TIS 620-2533 (1990) does not have it), and that
0xDC is undefined in TIS 620-2533, despite the fact that the
table in the Tru64 Unix documentation shows word sep. there.
The table is labelled the TACTIS Codeset for Thai API Consortium/
Thai Industrial Standard. I surmise that this is some vendor
extension to the actual TIS 620-2533 (1990). The actual standard
states clearly (in Thai) that 0x80..0xA0, 0xDB..0xDE, and 0xFC..0xFF
are reserved (unassigned), and the tables in the standard match that.

So there may be some implementation practice that uses 0xDC for
U+200B ZERO WIDTH SPACE in Thai code pages, but that is not
part of either TIS 620-2533 (1990) nor ISO 8859-11:2001.

--Ken

Re: Sporadic Unicode revisited

2002-10-02 Thread Kenneth Whistler


Keld responded:

 On Wed, Oct 02, 2002 at 02:47:42PM -0400, John Cowan wrote:
  Mark Davis scripsit:
  
   Those mnemonics in (http://www.faqs.org/rfcs/rfc1345.html) are pretty
   useless in practice, as well as being misnamed. From Websters: assisting or
   intended to assist memory. So what about the combination ;S is supposed
   to aid or assist memory in coming up with U+02BF MODIFIER LETTER LEFT HALF
   RING? Beats me.
  
  ; in many (though not all) mnemonics means ogonek, so its presence here
  is reasonable, considering that this character (which appears only in
  ISO-IR-158) is the original High Ogonek.  Since ISO-IR-158 is for Saami,
  perhaps S stands for Saami.  Writing S; would erroneously suggest
  S with ogonek.
 
 Well, the S stands for superscript, s here would mean subscript.

Or shade, as in:
 .S 2591LIGHT SHADE
 :S 2592MEDIUM SHADE
 ?S 2593DARK SHADE
Or space, as in:
 BS 0008BACKSPACE (BS)
 SP 0020SPACE
 IS 3000IDEOGRAPHIC SPACE
 NS 00a0NO-BREAK SPACE
(not to be confused with:
 nS 207fSUPERSCRIPT LATIN SMALL LETTER N)
Or spade, as in:
 cS 2660BLACK SPADE SUIT
 cS-2664WHITE SPADE SUIT
Or Z, as in:
 DS 0405CYRILLIC CAPITAL LETTER DZE (Macedonian)
Or selected, as in:
 ES 0087END OF SELECTED AREA (ESA)
 SA 0086START OF SELECTED AREA (SSA)
Or separator, as in:
 FS 001cFILE SEPARATOR (IS4)
 GS 001dGROUP SEPARATOR (IS3)
 RS 001eRECORD SEPARATOR (IS2)
 US 001fUNIT SEPARATOR (IS1)
Or square, as in:
 fS 25a0BLACK SQUARE
 OS 25a1WHITE SQUARE
 SR 25acBLACK RECTANGLE
Or set, as in:
 HS 0088CHARACTER TABULATION SET (HTS)
 VS 008aLINE TABULATION SET (VTS)
Or standard, as in:
 KSC327fKOREAN STANDARD SYMBOL
Or start, or string, as in:
 SS 0098START OF STRING (SOS)
 ST 009cSTRING TERMINATOR (ST)
 SX 0002START OF TEXT (STX)
 SG 0096START OF GUARDED AREA (SPA)
 SH 0001START OF HEADING (SOH)
Or substitute, as in:
 SB 001aSUBSTITUTE (SUB)
Or synchronous, as in:
 SY 0016SYNCRONOUS IDLE (SYN)
Or state, as in:
 TS 0093SET TRANSMIT STATE (STS)
Or shift, as in:
 SI 000fSHIFT IN (SI)
 SO 000eSHIFT OUT (SO)
Or single, as in:
 SC 009aSINGLE CHARACTER INTRODUCER (SCI)
Or sun, as in:
 SU 263cWHITE SUN WITH RAYS
Or section, as in:
 SE 00a7SECTION SIGN
Or service, as in:
 SM 2120SERVICE MARK

Of something-or-other (or spot ?), as in:
 Sb 2219BULLET OPERATOR
 Sn 25d8INVERSE BULLET

{Excuse me if I tend to confuse those two with Antimony and Tin,
 respectively, creating a mnemonic antinomy.}

Or, of course, S:
 S  0053LATIN CAPITAL LETTER S

The wondrous thing about this set of mnemonic symbols is that
you need a mnemonic system to remember all the mnemonics.

--Ken

Re: Sporadic Unicode revisited

2002-10-02 Thread Kenneth Whistler


John Cowan responded to Rick:

  (BTW, I agree with Mark about those ISO 14755 [recte: RFC 1345] 
  abbreviations... They aren't  
  very mnemonic. Many people have the charts available, so there is no  
  great advantage to using mnemonics over simply using numbers or palettes.)
 
 They are easy to type, and what is more, easy to proofread.  (This is the
 same argument I just made defending the ISO/SGML named character entities.)

I agree that *some* of the ideas behind the mnemonics in RFC 1345
make sense. The idea of typing a' for a-acute, for example, is quite
widespread, and useful in some circumstances.

But RFC 1345 is so full of flaws as a system, that it just falls in
on itself.

By insisting on only using the portable character set instead of ASCII,
it can't do the obvious for grave, circumflex, and tilde accents, for
example, so you get:

 a! 00e0LATIN SMALL LETTER A WITH GRAVE
 a 00e2LATIN SMALL LETTER A WITH CIRCUMFLEX
 a? 00e3LATIN SMALL LETTER A WITH TILDE

instead of the obvious and widely used: a`, a^, a~

Attempting to extend the system to Greek, Cyrillic, Hebrew, and Arabic
just (in my opinion) results in mnemonics that are harder to remember
than the character names, even. What is the real advantage of s*, s=, S+ and
s+ over sigma, es, samekh and seen for occasional usage? You end up having
to look up all those mnemonics in a table anyway, if you actually
want to use them.

And the system gets even sillier when it is expanded to some arbitrarily
defined subset of 10646 symbols and other characters, resulting in
ample evidence of the inextensibility of a basically two-letter scheme
when attempting to represent a large arbitrary set of things.

Combinations like '? are not particularly easier to type than ~ or 
even tilde, and there are many similar examples.

But most of all, in my opinion, the RFC 1345 mnemonics fail of a
fundamental criterion: a very substantial portion of them are just
not *memorable*.

--Ken

Re Permission to reproduce?

2002-10-01 Thread Kenneth Whistler


Martin Kochanski asked:

 
 I want to post a Cardbox database on our Web site (Cardbox is 
 the database that we sell) that contains a list of all 
 Unicode characters: hexadecimal code, decimal code, 
 character, and character name (eg. GREEK CAPITAL LETTER OMEGA 
 WITH TONOS). 
 
 The first three of these elements are in the public domain, 
 but it strikes me that the character names might be 
 considered to be a literary work and therefore copyright. 
 Does anyone know whether I do in fact need to ask permission 
 before listing those names, and if so, whom I need to ask?

In case it wasn't clear from the short discussion that followed,
let me state for the record:

The character names are a normative part of the Unicode Standard,
and are also identically defined as a normative part of the
International Standard, ISO/IEC 10646 (English version). They
are, indeed, a part of those publicly available standard(s), intended
for free, unrestricted use by all users of those standard(s).

So you don't need to ask anyone's permission to list or otherwise
use those character names.

You *would* have to ask permission (from the Unicode Consortium)
before reproducing the exact *form* of the Unicode code charts,
as printed in the Unicode Standard itself, since the form of
the charts and associated name lists printed there *are* under
copyright.

--Ken

Pound and Lira (was: Re: The Currency Symbol of China)

2002-09-30 Thread Kenneth Whistler


 Marco Cimarosti scripsit:
 
  The same should be true for the £ sign.
  
  But unluckily, for some obscure reason, Unicode thinks that currencies
  called pound should have one bar and be encoded with U+00A3, while
  currencies called lira should have two bars and be encoded with U+20A4.
 
 Every character has its own story.
 
 Can the old farts^W^Wtribal elders shed any light on this one?

Not much.

The proximate cause of the inclusion of U+20A4 LIRA SIGN in
10646 was:

WG2 N708, 1991-06-14, Table of Replies (to the ballot on
10646 DIS, DIS-1). That document contains the U.S. comments
asking for all the additions which would synchronize the DIS
repertoire with the Unicode 1.0 repertoire, and that included
U+20A4 LIRA SIGN.

It is a deeper subject to figure out how the LIRA SIGN got into
Unicode 1.0 in the first place, and I don't have all the
relevant documents to hand to track it down. It was certainly
already in the April 1990 pre-publication draft of Unicode 1.0
which was widely circulated.

I do recall the issue of one-bar versus two-bar yen/yuan sign
being researched in detail and being explicitly decided.
I also recall explicit (and tedious) discussions about
the various dollar sign glyphs. I do not, however, recall
any time spent in discussing the analogous problem of glyph
alternates for the pound/lira sign, although it was probably
mentioned in passing. So it is possible that the lira sign
simply derives from a draft list that was standardized
without anyone ever spending time to debate the pound/lira
symbol unification first. It was probably in the same lists
that distinguished yen/yuan sign before it was determined
that distinguishing those two as a *character* was untenable.

Those were heady days. It is generally much easier to track
down why something was added post-Unicode 1.0 than it is
to figure out how something got into Unicode 1.0 in the
first place.

To quote from a particularly memorable email I sent around
on April 4, 1991 about an unrelated mistake that was almost
made:

  The High Ogonek is symptomatic of one of the things
   wrong about the character standardization business,
   which encourages the blithe perpetuation of mistaken
   'characters' from standard to standard, like code
   viruses. At least, in the past, the epidemic was
   constrained by the fact that the encoding bodies only
   had 256 cells which could get infected by such
   abominations as half-integral signs. Now, however,...
   the number of cells available for infection is vast,
   and the temptation to encode everybody else's junk
   just seems to have become irresistible...

  ...I don't think I would be telling any tales out of
   school if I revealed that Unicode almost got a 'High
   ogonek', too, since Unicode was busy incorporating
   all the 10646 mistakes in Unicode while 10646 was busy
   incorporating all the Unicode mistakes in 10646. ...

--Ken

RE: The Currency Symbol of China

2002-09-30 Thread Kenneth Whistler


Barry Caplan wrote [further morphing this thread]:

 I also think (but I could be wrong) that ye is not one 
 of the characters in the famous Buddhist poem that uses 
 each of the kana once and only once, and establishes a 
 de facto sorting order by virtue of being the only such poem.
 
 OTOH, I am pretty sure that poem is either from or 
 post-dates the Heian era, so it wouldn't rule out your point.

In a totally different context, I was looking into this recently
and found some stuff the list might find amusing.

The kana that is usually missing from the poem is -n, i.e. U+3093.

quotemyself

P.S. In case you don't have it already, the i-ro-ha order is:

i ro ha ni ho he to
chi ri nu ru wo
wa ka yo ta re so
tsu ne na ra mu
u wi no o ku ya ma
ke fu ko e te
[^ that is one ]
a sa ki yu me mi shi
ye hi mo se su
[^ that is the other -- probably should be (w)e ]

See, e.g.,

http://ccwww.kek.jp/iad/fink/western/wIJ2.html

[Attributed to middle Heian, around A.D. 1000.]

It was actually printed in the Unicode 1.0 book, when the
circled Katakana characters at U+32D0..U+32FE were in i-ro-ha
order. That was changed for Unicode 1.1, to synch up with the
preferring a-i-u-e-o order for these characters in 10646.

BTW, the translation of Kukai's iroha poem at that link
leaves much to be desired, though the various version
shown in hiragana, katakana, and with kanji are interesting.
A much, much better translation can be found at:

http://www.raincheck.de/html/i-ro-ha___english.html

or, in German(!), at:

http://www.raincheck.de/html/i-ro-ha.html

The English translation is quite literal. The German --
how shall I put it -- takes some poetic license. ;-)

/quotemyself

Or, for a really challenging version, you can try puzzling
out:

http://www.miho.or.jp/booth/html/imgbig/3247e.htm

which shows a manyoogana version (all kanji, used
syllabically), tacking on the epenthetic U+65E0 mu
for the -n, which some versions of the poem do, just
to be tidy.

--Ken

Re: glyph selection for Unicode in browsers

2002-09-26 Thread Kenneth Whistler


Tex,

 3) The language information used to be derived

dubiously

 from code page and is
 missing with Unicode, and architecture needs to accomodate a better
 model for bringing language to font selection.

The archetypal situation is for CJK, and in particular J,
where language choice correlates closely with typographical
preferences, and where character encoding could, in turn,
be correlated reliably with language choice.

But in general, the connection does not hold, as for data
in any of hundreds of different languages written in Code Page 1252,
for example.

What you are really looking for, I believe, is a way to
specify typographical preference, which then can be used to
drive auto-selection of fonts.

I don't think we should head down the garden path of trying
to tie typographical preference too closely to language identity,
however we unknot that particular problem. This could get
you into contrarian problems, where browsers (or other tools)
start paying *too* much attention to language tags, and
automatically (and mysteriously) override user preferences
about the typographical preferences they expect for characters.

What is needed, I believe, is:

  a. a way to establish typographic preferences
  b. a way to link typographical preference choices to
   fonts that would express them correctly
  c. a way to (optionally) associate a language with
   a typographical preference

And this all should be done, of course, in such a way that
default behavior is reasonable and undue burdens of understanding,
font acquisition, installation, and such
are not placed on end-users who simply want to read and print
documents from the web.

A tall order, I am sure. But as long as we are blue-skying about
architecture for better solutions, I think it is important
not to replace one broken model (code page = language) with
another broken model (language = font preference).

--Ken

Re: Sequences of combining characters (from Romanization of Cyrillic andByzantine legal codes)

2002-09-26 Thread Kenneth Whistler


William Overington asked:

 While on the topic, how would the following sequence be displayed please?
 
 U+0074 U+0361 U+0073 ZWJ U+0307

Just like:

U+0074 U+0361 U+0073 U+0307

The sequence U+0073, ZWJ, U+0307 could request a ligature of the
s and the dot-above, but since it is unlikely that any type designer
is going to actually ligate the dot into the s and produce a
ligature glyph for it, the sequence is likely to be rendered as
if it were just U+0073, U+0307, that is an s with a dot-above.

 
 I am not suggesting this for bibliographic work, just wondering: for the
 bibliographic work I feel that a new character of a COMBINING DOUBLE
 INVERTED BREVE WITH DOT ABOVE might be a good solution.

Possibly. It is certainly a simple solution.

--Ken

 
 William Overington
 
 25 September 2002

Re: Keys. (derives from Re: Sequences of combining characters.)

2002-09-26 Thread Kenneth Whistler


Peter responded:

 A document would contain a sequence such as follows.
 
 U+2604 U+0302 U+20E3 12001 U+2460 London U+2604 U+0302 U+20E2
 
 
 You could just as easily have used
 
 S C=12001London/S
 
 or
 
 S C=12001 P1=London/

or even:

cometcircumflex messageId=12001London/cometcircumflex

if one likes the ring of comet circumflex for one's tags.

 which are only slightly more verbose, but which follow a widely-implemented
 standard 

namely, XML, which I think effectively gainsays William's earlier
comment:

 XML does not suit my specific need as far as I can tell.

And as far as the idea of having parameterized messages, with
translation catalogs, I would join the chorus inviting William
to investigate state of the art before attempting to invent
something that already exists in many forms.

Or, to further mangle Marco's musical metaphor, as you
go round and around on this topic, make sure that
you don't mix up the apples *for* the horses with the
horseapples *from* the horses.

--Ken ;-)

Re: Sequences of combining characters (from Romanization of Cyrillic and Byzantine legal codes)

2002-09-20 Thread Kenneth Whistler


Charles Cox suggested:

 Might there be a case for defining an invisible combining enclosing mark
 (ICEM), which is otherwise identical to the enclosing circle? Then, if I've
 understood the conventions correctly the sequence:
 U+0074  U+034F  U+0073  ICEM  U+0311  U+0307 would give ts with a centrally
 placed inverted breve and a centrally placed dot above the inverted breve.

We have talked about that option. It has a certain elegance
to it as well. But implementers are getting very leery of
continuing to add invisible format control characters of
various types into the mix. They often seem to introduce
unanticipated problems for rendering systems.

My current feeling is that while we have demonstrable cases of
visibly ligated digraphs with dots above in print, it isn't clear
that we have a significant data representation problem that
*requires* the introduction of some new mechanism -- yet.
This stuff *can* all be handled with appropriately designed
ligations in fonts, so there are options for display:

U+0074, U+0361, U+0073, U+0307 

   ==
   maps via ligation table to:

{t-s-tie-ligature-with-dot-above} glyph

even though the default rendering would be:

{t-s-dot-tie-ligature} glyph

--Ken

Re: Sequences of combining characters (from Romanization of Cyrillic andByzantine legal codes)

2002-09-20 Thread Kenneth Whistler


Peter said:

 This stuff *can* all be handled with appropriately designed
 ligations in fonts, so there are options for display:
 
 U+0074, U+0361, U+0073, U+0307
 
==
maps via ligation table to:
 
 {t-s-tie-ligature-with-dot-above} glyph
 
 I would consider this an anomolous rendering. It is counter-exemplified by
 figure 7-6 in TUS3.0. I'd be concerned of longer-term problems if we
 decided to say that this was a valid alternate rendering from
 
 {t-s-dot-tie-ligature} glyph

Well, yes, it would be anomalous, which is why it would require
somebody to go to the trouble to make a special ligation table
entry for it.

But what longer-term problems are you talking about? I didn't
say we should put in a formal rendering *rule* in the Unicode
Standard that says something different from Figure 7-6, along
the lines of converting one form to the other as above.

Look, let's consider again what problem we are trying to solve
here. We have two funky forms from the ALA-LC transliteration
tables, for which we haven't heard back yet from bibliographic
sources whether there actually is any *actual* data representation
problem in USMARC records.

We can try to invent and promulgate a generic rendering solution
for these cases (and anything like them) in the Unicode Standard,
despite the fact that they are an edge case of an edge case for
Latin script rendering... Or, if it turns out that it isn't a
general-enough problem to force everyone to deal with it in terms
of generic rendering, we could suggest alternatives:

   a. markup solutions
   b. specific font ligation solutions for specialized data

Now consider again the function of these things in the ALA-LC
transliteration. The Cyrillic transliteration recommendations
make rather extensive use of ligature ties. Why? Because the
ALA-LC transliteration schemes make some effort to be round-trippable.
In other words, the Cyrillic transliteration they recommend is
not merely a useful romanization that might be in more general
use, as for a newspaper, but is a romanization from which, in
principle, you ought to be able to recover the Cyrillic it
was transliterated from. Thus these schemes distinguish t-s
from t-s-tie-ligature, since the ligated form might be a
transliteration of a tse or similar letter, whereas the t-s
would be a transliteration of a te+es, and so on. In other
words, the tie-ligatures are being sprinkled in to make ad hoc
digraphs for the transliteration, to aid in recovery of the
Cyrillic from the romanization.

Now the dots above typically represent an articulatory diacritic,
as for palatalization, or the like.

So the combination of the two is to indicate: we are transliterating
a letter with a palatal (say) diacritic, using a digraph.

Do we have alternatives in Unicode for that? Well, yes, depending
on whether the problem is:

  a. enabling exact transcoding of the USMARC data records
 using ALA-LC romanization recommendations and the ANSEL
 character set, for interoperability with Unicode systems.

or

  b. typesetting the ALA-LC romanization document guide in
 Unicode, treating all the data therein as plain text and
 using generic Unicode rendering rules.

I contend that the primary problem is a), and that we ought
to examine the general usefulness of this dot-above-double-diacritic
and related rendering, before we insist it has to be representable
in plain text and go looking for an encoding solution and specify a
bunch of rendering rules for it.

If the essential requirement here is to capture the data
functionality of the transliteration: a roundtrippable form,
with a palatal diacritic, using a digraph, we could suggest,
for instance:

U+0074, U+034F, U+0073, U+0307

or

U+0074, U+0307, U+034F, U+0073

where we end up with an explicitly indicated digraph, with a
dot-above diacritic (pick which letter you want it on), as
a grapheme cluster. This is distinct from:

U+0074, U+0073, U+0307

or

U+0074, U+0307, U+0073

so you have your transliteration round-trippability intact.

And for your special-purpose application, which is a Unicode system
to display USMARC bibliographic records using the ALA-LC romanization
presentation conventions, you add ligation entries to your font
so that

U+0074, U+034F, U+0073, U+0307

and similar forms using a U+034F GRAPHEME JOINER display with a
visible tie-ligature, rather than nothing, despite the fact that
no U+0361 double diacritic is being used in the data. Problem
solved.

Of course, that doesn't mean that your converted USMARC data
records involving digraphs for Cyrillic transliteration will
display with the tie-ligature in a generic web application using
off-the-shelf fonts -- but is that the problem we are trying
to solve here? I doubt it. The forms would be legible -- perhaps
more legible without the obtrusive ties cluttering them up --
and the data distinctions would still be preserved in such
contexts.

--Ken

Re: Sequences of combining characters (from Romanization of Cyrillic and Byzantine legal codes)

2002-09-18 Thread Kenneth Whistler


William Overington asked:

 In the discussion about romanization of Cyrillic ligatures I asked how one
 expresses in Unicode the ts ligature with a dot above.
 
 Regarding Ken's response to the Byzantine legal codes matter, it would
 appear possible that the way that the ts ligature with a dot above for
 romanization of Cyrillic could be represented in Unicode would be by the
 following sequence.
 
 t U+FE20 s U+FE21 U+0307
 
 The ordinary ts ligature for romanization of Cyrillic being expressed as
 follows.
 
 t U+FE20 s U+FE21
 

As Peter indicated, the preferred way to represent this graphic ligature
tie in Unicode is with the double diacritics, i.e.:

t U+0361 s

U+FE20 and U+FE21 are compatibility characters, for interoperation,
in particular, with the USMARC catalog records using the Extended
Latin Alphabet Coded Character Set for Bibliographic Use (ANSEL). See:

http://lcweb.loc.gov/catdir/cpso/romanization/charsets/pdf

 It appears to me that the ts ligature with a dot above, and a similar ng
 ligature with a dot above, are already needed for the Library of Congress
 romanization of Cyrillic system.
 
 The following directory contains a lot of pdf files.
 
 http://lcweb.loc.gov/catdir/cpso/romanization
 
 The ts ligature with a dot above can be found on page 2 of the nonslav.pdf
 file.  The ng ligature with a dot above can be found on page 13 of the same
 file.

And, in particular, the ts ligature with a dot above is for an Abkhaz
romanization, and the ng ligature with a dot above is for an obsolete
Mansi (related to Khanty) romanization. I suspect their actual use
is pretty limited.

 
 Capital letter versions of the two ligatures are needed as well.

Well, this is interesting, since these were *added*, systematically,
to the 1997 version of the ALA-LC non-Slavic romanization systems. The
1990 version did not have them.

That raises the question of whether these were simply editorial
extensions, or were actually *needed* for some bibliographical
data. I consider it unlikely that all of the capital forms were
suddenly discovered between 1990 and 1997 and that a whole bunch
of USMARC bibliographical records making use of the capital forms
were created during that interval.

In this regard, one should *read* the ALA-LC document. See charsets.pdf:

The transliterations produced by applying ALA-LC Romanization Tables
are encoded in machine-readable form into USMARC records. Encoding of
the basic Latin alphabet, special characters, and character modifiers
listed in this publication is done in USMARC records following two
American National Standards; the Code for Information Interchange
(ASCII) (ANSI X3.4), and the Extended Latin Alphabet Coded Character
Set for Bibliographic Use (ANSEL) (ANSI Z39.47). Each character
is assigned a unique hexadecimal (base-16) code which identifies it
unambiguously for computer processing.

The current version of how that is done is the MARC 21 Specifications
for Record Structure, Character Sets, and Exchange Media. Among other
things, that specification spells out how the combining marks are used with base
characters in USMARC records. 

I don't know, however, if any provision was actually made in MARC 21 
for these instances of ligature ties with dots above, however. Perhaps
someone familiar with the details of USMARC can answer that.

The USMARC records (using ANSEL) *would*, however, be making use
of the half ligature characters:

0xEB LIGATURE, FIRST HALF
0xEC LIGATURE, SECOND HALF

as well as:

0xE7 SUPERIOD [sic] DOT   (s.b. SUPERIOR DOT)

It just isn't clear exactly what order these would occur in any
hypothetical USMARC record actually using either the Abkhaz or
Mansi romanizations in question.

 I wonder if consideration could please be given as to whether this matter
 should be left unregulated or whether some level of regulation should be
 used.

I think this should depend first on a determination of whether there
is a demonstrated need for an actual representation of these sequences --
which ought to be determined by the people responsible for the
data stores which might contain them, namely the online bibliographic
community.

The ALA-LC conventions are not the only alternatives available for
representation of Abkhaz and/or Khanty/Mansi data in romanization.
In fact, you can find such data on the web using alternative
romanizations. So it isn't as if the current gap in figuring out
precisely how, in Unicode, to represent a double diacritic with
another diacritic applied outside the visible double diacritic
on a digraph is preventing anyone from using romanized Abkhaz or
Khanty/Mansi data in interchange.

--Ken

 
 William Overington
 
 18 September 2002

Re: Sequences of combining characters (from Romanization of Cyrillicand Byzantine legal codes)

2002-09-18 Thread Kenneth Whistler


 The ALA-LC conventions are not the only alternatives available for
 representation of Abkhaz and/or Khanty/Mansi data in romanization.
 In fact, you can find such data on the web using alternative
 romanizations. So it isn't as if the current gap in figuring out
 precisely how, in Unicode, to represent a double diacritic with
 another diacritic applied outside the visible double diacritic
 on a digraph is preventing anyone from using romanized Abkhaz or
 Khanty/Mansi data in interchange.
 
 By the same argument, Unicode might as well stop taking new characters; 
 surely, between the 500 Latin characters and dozens of punctuation marks 
 and combining characters and the other 70,000 characters, you can find 
 a way to communicate whatever language or data you need communicated.

Of course. Let them use ASCII, for that matter.

But that wasn't my point. There is no particular evidence
that the ALA-LC conventions with the dot above the graphic
ligature ties is in widespread use for romanizations of these
particular languages, that I can see. So the *urgency* of
solving this problem isn't there, unless the LC/library/bibliographic
community comes to the UTC and indicates that they have a data interchange
problem with USMARC records using ANSEL that requires a clear
representation solution in Unicode. And before we go there, I'd
like to have a clear specification of how it works in USMARC
records, so we would know how to do the following conversion:

USMARC -- Unicode

for the two forms in question.

The 1990 version of the LC romanizations for this non-Slavic stuff
used all kinds of hand-drawn forms. And even the 1997 version of
the ALA-LC document is photo-offset from pages that include various
kinds of pasteup from who-knows-what sources, including some
hand-drawn, with at least one of these dots above being added by
hand. So it isn't clear that there is any connection between the
ALA-LC document text and the ANSEL character encoding actually used
in the USMARC records; this could be arbitrary markup with some
system like TEX for publication.

BTW, if we are blueskying about this topic, the *elegant* way
to resolve this would be to recategorize all the double
diacritics as *enclosing* combining marks (Me), rather than
Mn, and then rewriting the conventions for their use to
match those of the enclosing circle and such. Then they
would subtend (or supertend) any grapheme cluster, including
arbitrary digraphs indicated with a COMBINING GRAPHEME JOINER
character. And a dot above would neatly apply to the entire
subtended cluster, as for circled characters, and so on.
Of course, that would invalidate anybody's current
usage of the characters. Oh well, you can't win 'em all.

--Ken

Re: French or German Unicode Names??

2002-09-17 Thread Kenneth Whistler


Ms. Hughes,

ISO/IEC 10646-1:2000, which is exactly correlated with the
Unicode Standard, Version 3.0, is available in French. You
can purchase a copy from ISO:

http://www.iso.ch/

(Go to the ISO Store section of the site and search for
the ISO number 10646.)

I don't know of any German translation of all the character
names. As far as I know, German users of the standard simply
make use of the English names of the characters. But you
could confirm by contacting the German standards organization,
DIN:

http://www.din.de

[EMAIL PROTECTED]

--Ken Whistler

- Begin Included Message -

 -Original Message-
 Date/Time:Tue Sep 17 01:20:08 EDT 2002
 Contact:  [EMAIL PROTECTED]
 Report Type:  Other Question, Problem, or Feedback
 
 HI,
 We are trying to find a set of the unicode tables with the 
 character labels in French, and one where they are in German. 
  Do you have these available, or can you point us in the 
 direction of where we might find them please?
 
 Kind regards,
 Maryanne Hughes
 Technical Writer,
 Pulse Data International,
 New Zealand
 
 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
 (End of Report)
 
 



- End Included Message -

UTF-8 (was Re: Mercury News: Hawaiian on a Mac)

2002-09-05 Thread Kenneth Whistler


Markus Scherer responded:

 Stefan Persson wrote:
 
  This links to a different page on the same server:
  
  http://www.cl.cam.ac.uk/~mgk25/unicode.html
  
  That page contains a strange UTF-8 table:
  ...
  The last two byte sequences are invalid.
 
 
 Markus Kuhn's page shows the original ISO 10646 definition.
   
and still current ISO/IEC 10646 definition. Table D.1 in Annex D
UCS Transformation Format 8 (UTF-8).

Note that the definition of the 5- and 6-byte UTF-8 sequences
for code positions past U-001F is essentially harmless,
as ISO/IEC 10646 now contains explicit language indicating
the non-intention to encode any characters at code positions
past U-0010. So the definition of the 5- and 6-byte sequences
is vacuous -- no such sequence will ever be a valid representation
of an *encoded character* in 10646.

 This necessarily includes all codes up to 7FFF.
 It also includes D800..DFFF, which is not allowed in Unicode 3.2 
 and the RFC on UTF-8, and I think implicitly not allowed in ISO 10646.

They are *explicitly* not allowed in UTF-8 in ISO/IEC 10646 as well.
From Clause D.4 Mapping from UCS-4 form to UTF-8 form:

  Values of x in the range  D800 ..  DFFF are
   reserved for the UTF-16 form and do not occur in UCS-4.
   The mappings of these code positions in UTF-8 are undefined.

--Ken

Re: various stroked characters

2002-09-05 Thread Kenneth Whistler


Peter,

Here's my take on your questions.

 The less clear cases involve b, d and g.
 
 1) Lower case b with a horizontal stroke through the bowl (hereafter
 b-stroke-bowl) is used in some phonetic traditions for voiced bilabial
 fricative (beta, in IPA). The annotation for U+0180 (b with a horizontal
 stroke across the ascender) indicates that one of its intended purposes is
 for phonetic transcription of the same phone. Of course, U+03B2 (beta) also
 has this function and is not unified with 0180, but these are clearly
 distinct characters (e.g. 0180 and 03B2 have other unrelated functions). I
 can't imagine anyone using b-stroke-bowl contrastively with 0180. Thus,
 probably the best option is to treat b-stroke-bowl as a typographic variant
 of 0180.
 
 Any opinions confirming this view or to the contrary?

I agree.

This is what Pullum and Ladusaw called the Barred B, as opposed to the
Indo-European Crossed B (i.e. U+0180):

  By a general convention, barred stop symbols (with a superimposed
   hyphen or short dash through the body of the letter) are often used
   to represent those fricatives for which the IPA symbols are not used.
   The resultant symbols have the advantage of being easy to type on an
   unmodified typewriter.

By the way, there is also the Slashed B, which is another alternative
form for the Barred B, used for the same purpose, but instantiated by
typing b backspace / instead of b backspace -.

For what it is worth, the founders of Unicode considered these three
forms to be allographs of an abstract barred-b character, so that is
what the current situation is. Trying to separately encode a Barred B
distinct from the Crossed B would, at this point, constitute an
explicit disunification, rather than simply a discovery of an overlooked
character to encode.

 2) Next, consider the g. The representative glyph in TUS3.0 for U+01E5
 shows a double-bowl g with a horizontal stroke through both sides of the
 bottom bowl. The annotation indicates that it is used for Skolt Saami.
 Looking at a few fonts, I see some variations: Andale Mono and Code 2000
 have a double-bowl g with a horizontal stroke through *the right side only*
 of the lower bowl; Lucida Sans Unicode and Arial Unicode MS have a
 single-bowl g with a horizontal stroke through the right side only of the
 bowl.

Pullum and Ladusaw show two other glyphic alternatives:

  Barred G with an IPA style g and a horizontal stroke through the bowel.

  Crossed G with an IPA style g and a horizontal stroke through the descender.

 
 Now, what I'm concerned with is a g (single-bowl in all instances I'm
 familiar with) that has a horizontal stroke through both sides of the
 (upper -- only) bowl, used in some phonetic traditions to represent a
 voiced velar fricative (IPA gamma). Any opinions on whether to treat this
 as a new character or as a typographic variant of U+01E5?

All allographs of the same underlying character. The same concepts
and analogies apply here. The Crossed G was probably explicitly
formed by analogy from the more-attested Crossed B and Crossed D.
The ones with horizontal strokes through the bowel are all just
variants on what happens when you backspace and put a hyphen across
your g.


 3) Finally, the d. Unicode has three upper-case stroked-d characters for
 which the representative glyphs are identical, but which have distinct
 lower-case counterparts (the basis for having three distinct upper-case
 characters). Of the three pairs, two really aren't relevant to this
 discussion. The one relevant pair is U+0110 LATIN CAPITAL LETTER D WITH
 STROKE, and U+0111 LATIN SMALL LETTER D WITH STROKE.
 
 Now, in some phonetic traditions, a d with a horizontal stroke through
 the bowl (both sides) is used for a voiced interdental fricative (IPA
 U+00F0). Some phonetic traditions represent this using U+0111.
 
 I've also learned of some African languages that are written with upper and
 lower stroked d; I've seen samples that show some glyph variation: some
 samples show a horizontal stroke that crosses both sides (both upper and
 lower case); other samples show the horizontal stroke on only one side --
 through the stem of the upper case (just like U+00D0, U+0110 and U+0189),
 and through the right side of the bowl of the lower case (not through the
 ascender, as shown in the charts for U+0111).
 
 So, again: any opinions on whether d-stroke-bowl should be unified with
 U+0111 or considered a new character?

Again, all allographs of the same underlying character. And once
again, as for b, there are, in addition to the Crossed D and
Barred D allographs, also a Slashed D allograph.

There is no need to proliferate distinct encodings for these, whether the
slashes of the Barred D forms go all the way across or just partway
across either the lowercase and/or the uppercase forms. Those are just
various typographic attempts to do decent design for the letter forms
based on the concept of having to apply a horizontal stroke to the
d/D

RE: Double Macrons on gh...

2002-08-30 Thread Kenneth Whistler


Robert Wheelock asked:

 Recently, I read some messages saying that there're 3 new
 double-wide overstruck accents are proposed for Unicode:

Umm. Well, they aren't double-wide and they aren't overstruck,
and their names are not:

 035D:  double-wide breve
 035E:  double-wide macron
 035F:  double-wide underbar (d-w combining low line)

but rather:

035D COMBINING DOUBLE BREVE
035E COMBINING DOUBLE MACRON
035F COMBINING DOUBLE LOW LINE

 Please send me more info (and some documentation) on those accents.

These would occur in sequences such as:

o, combining double breve, o

to give the effect of a breve stretched over a pair of o's, as
often seen in Webster-style dictionary pronunciation guides.
Technically, the combining double accents combine with the
base letter they follow, but their glyphs would be designed so
that they would overhang a following base letter as well.
In practice, fonts might simply choose to have ligatures for
the entire sequence, to avoid complications of calculating
the accent positions dynamically.

For more examples, just look in dictionary pronunciation guides.

--Ken

Re: Revised proposal for Missing character glyph

2002-08-26 Thread Kenneth Whistler


[Resend of a response which got eaten by the Unicode email
during the system maintenance last week. Carl already responded
to me on this, but others may not have seen what he was
responding to. --Ken]


 Proposed unknown and missing character representation.  This would be an
 alternate to method currently described in 5.3.
 
 The missing or unknown character would be represented as a series of
 vertical hex digit pairs for each byte of the character.

The problem I have with this is that is seems to be an overengineered
approach that conflates two issues:

  a. What does a font do when requested to display a character
 (or sequence) for which it has no glyph.

  b. What does a user do to diagnose text content that may be
 causing a rendering failure.

For the first problem, we already have a widespread approach that
seems adequate. And other correspondents on this topic have pointed
out that the particular approach of displaying up hex numbers for
characters may pose technical difficulties for at least some font
technologies. 

[snip]
 
 
 This representation would be recognized by untrained people as unrenderable
 data or garbage.  So it would serve the same function as a missing glyph
 character except that it would be different from normal glyphs so that they
 would know that something was wrong and the text did not just happen to have
 funny characters.

I don't see any particular problem in training people to recognize when
they are seeing their fonts' notdef glyphs. The whole concept of seeing
little boxes where the characters should be is not hard to explain to
people -- even to people who otherwise have difficulty with a lot of
computer abstractions.

Things will be better-behaved when applications finally get past the
related but worse problem of screwing up the character encodings --
which results in the more typical misdisplay: lots of recognizable 
glyphs, but randomly arranged into nonsensical junk. (Ah, yes, that
must be another piece of Korean spam mail in my mail tray.)

 
 It would aid people in finding the problem and for people with Unicode books
 the text would be decipherable.  If the information was truly critical they
 could have the text deciphered.

Rather than trying to engineer a questionable solution into the fonts,
I'd like to step back and ask what would better serve the user
in such circumstances.

And an approach which strikes me as a much more useful and extensible
way to deal with this would be the concept of a What's This?
text accessory. Essentially a small tool that a user could select
a piece of text with (think of it like a little magnifying glass,
if you will), which will then pop up the contents selected, deconstructed
into its character sequence explicitly. Limited versions of such things
exist already -- such as the tooltip-like popup windows for Asmus'
Unibook program, which give attribute information for characters
in the code chart. But I'm thinking of something a little more generic,
associated with textedit/richedit type text editing areas (or associated
with general word processing programs).

The reason why such an approach is more extensible is that it is not
merely focussed on the nondisplayable character glyph issue, but rather
represents a general ability to query text, whether normally
displayable or not. I could query a black box notdef glyph to find
out what in the text caused its display; but I could just as well
query a properly displayed Telugu glyph, for example, to find out what 
it was, as well.

This is comparable (although more point-oriented) to the concept of
giving people a source display for HTML, so they can figure out
what in the markup is causing rendering problems for their rich
text content.

[snip]

 This proposal would provide a standardized approach that vendors could adopt
 to clarify missing character rendering and reduce support costs.  By
 including this in the standard we could provide a cross vendor approach.
 This would provide a consistent solution.

In my opinion, the standard already provides a description of a cross-vendor
approach to the notdef glyph problem, with the advantage that it is
the de facto, widely adopted approach as well. As long as font vendors stay
away from making {p}'s and {q}'s their notdef glyphs, as I think we can
safely presume they will, and instead use variants on the themes of hollowed
or filled boxes, then the problem of *recognition* of the notdef glyphs
for what they are is a pretty marginal problem.

And as for how to provide users better diagnostics for figuring out the
content of undisplayable text, I suppose the standard could suggest some
implementation guidelines there, but this might be a better area to just
leave up to competing implementation practice until certain user interface
models catch on and get widespread acceptance.

--Ken

Re: The Unicode Technical Committee meeting in Redmond, Washington State, USA.

2002-08-26 Thread Kenneth Whistler


William Overington inquired:

 As many readers may know, the Unicode Technical Committee was due to start a
 four day meeting yesterday at the Redmond, Washington State, USA campus of
 Microsoft, that is, on 20 August 2002.
 
 Here in England I am interested to know of what is happening and to learn of
 news from the meeting.

As Sarasvati has indicated, minutes will be publicly posted in a few weeks.
See:
 
http://www.unicode.org/unicode/consortium/utc-minutes.html

[BTW, the minutes from the February and April/May meetings have actually been 
approved, although their status has not been updated to Approved yet on the 
website page.]

 It is the early hours of the morning in Washington State at present.  It is
 hoped that when delegates get up for breakfast that they might look in their
 emails and make early morning responses, or perhaps arrange for an official
 briefing to be posted later in the day.
 
 If I were conducting a live interview with the committee chairman or with an
 official spokesperson I would ask the following questions.

Unfortunately, the UTC has not yet arranged its television contract
with ESPN, since character encoding has not generally been considered
a mass-appeal spectator sport.

However, since I did attend the UTC meeting last week, I may be able to
provide up-to-date commentary regarding some of the questions which are
not better answered by waiting for the official minutes.

 * What was discussed yesterday (Tuesday) please, and what formal decisions,
 if any, were taken please?

Wait for the minutes.

 
 * How many people attended please?

16 on Tuesday. 18 on Wednesday. Back down to 15(?) on Thursday and Friday.

 
 * Is it only companies which are full members of the Unicode Consortium who
 send delegates to the meeting, or are there also representatives of
 organizations who do not vote in decisions present as well?

The latter.

 * Will there be a press statement at the close of the meeting please, and if
 so, will it also be posted in the Unicode mailing list please?

No, there will not be a press statement. Encoding of a VERTICAL LINE EXTENSION
character was not considered of such earth-shattering consequence that
it would lead to headlines in the technology press.

 * Has there been, or is there on the agenda, any discussion of the wording
 in the Unicode specification about the use of the Private Use Area and, if
 so, are any changes to that wording being implemented?

Not discussed by the UTC last week. This is in the purview of the editorial
committee.

 
 * Has there been, or is there on the agenda, any discussion concerning the
 status of the code points U+FFF9 through to U+FFFC please?  There has been
 some discussion recently in the Unicode mailing list about these code
 points, as regards issues of U+FFF9 through to U+FFFB as an issue, the issue
 of using U+FFFC as a single issue, and the issue of using U+FFF9 through to
 U+FFFC all together.  Is the committee discussing these issues at all and,
 if so, are they discussing the matter of whether U+FFFC can be used in
 sending documents from a sender to a receiver please?  Is there any
 discussion of a possible rewording, or changing of meaning, of the wording
 about the U+FFF9 through to U+FFFC code points in the Unicode specification
 please?

Not discussed by the UTC last week. This is in the purview of the editorial
committee.

 
 * Are any matters concerning how the Unicode specification interacts with
 the way that fonts are implemented being discussed please? 

Yes. In a general way, this ends up being discussed at every meeting. 

 If so, is due
 care being taken that as font format is not, at present, an international
 standards matter that therefore the committee must take great care to ensure
 that Unicode does not become dependent upon a usage, express or implied, of
 the intellectual property rights or format of any particular font format
 specification?

The UTC always attempts to exercise due care in what it considers, but it
is unclear just what clarification you are asking for here. The UTC does
not standardize font formats.

 * Is there any discussion of the possibility of adding further noncharacters
 please, considering either or both adding some more noncharacters in plane 0
 and a large block of noncharacters in one of the planes 1 through to 14?

No.

 * Is the committee discussing the issue of interpretation, namely as to how,
 if various people read the published specification so as to have different
 meanings, how people may receive a ruling as to the formally correct meaning
 of the wording of the specification.  This recently arose in relation to the
 U+FFFC character and has previously arisen in relation to what is correct
 usage of the Private Use Area, so there are at least two areas where the
 issue of interpretation has arisen.

No. The UTC is a standardization committee, not a court of law.

If a problem of interpretation of the standard arises, and if the UTC
thinks that is a

Re: The existing rules for U+FFF9 through to U+FFFC. (spins from Re: Furigana)

2002-08-15 Thread Kenneth Whistler


 An interesting point for consideration is as to whether the following
 sequence is permitted in interchanged documents.
 
 U+FFF9 U+FFFC U+FFFA Temperature variation with time. U+FFFB
 
 That is, the annotated text is an object replacement character and the
 annotation is a caption for a graphic.

Yes, permitted. As would also be:

U+FFF9 U+FFFC U+FFFC U+FFFA U+FFF9 Temperature U+FFFA a measure of
hotness, related to the U+FFF9 kinetic energy U+FFFA energy of motion U+FFFB
of molecules of a substance U+FFFB U+FFF9 variation U+FFFA rate of change 
U+FFFB with time U+FFFC . U+FFFB

Where the first U+FFFC is associated with a URL with a realtime data feed,
the second U+FFFC is a jar file for a 3-dimensional dynamic display algorithm,
and the third U+FFFC is a banner ad for Swatch watches.

 It seems to me that if that is indeed permissible that it could potentially
 be a useful facility.

Permissible does not imply useful, however, in this case. It is
unlikely that you are going to have access to software that would
unscramble such layering in purported plain text, even if you
had agreements with your receivers. That is what markup and rich
text formats are for.

Note that it is also *permissible* in Unicode to spell permissible
as purrmisuhbal. That doesn't mean that it would be a good idea
to do so, but the standard does not preclude you from doing so.
You could even write a rendering algorithm which would display the
sequence of Unicode characters p,u,r,r,m,i,s,u,h,b,a,l with the glyphs
{permissible} if you so choose.

--Ken

Re: Furigana

2002-08-14 Thread Kenneth Whistler


Doug (and Michael also):

 What if I *want* to design an annotation-aware rendering mechanism?
 Suppose I read Section 13.6 and decide that, instead of just throwing
 the annotation characters away, I should attempt to display them
 directly above (and smaller than) the normal text, the way furigana
 are displayed above kanji.
 
 This would work not only for typical Japanese ruby, but also for
 Michael's English-or-Swedish-over-Bliss scenario.  It might even be
 useful in assisting beleaguered Azerbaijanis, for example, by annotating
 Latin-script text with its Cyrillic equivalent.  (Just a thought.)
 
 Would this be conformant?

Well, technically conformant, but not wise. If commonly available
display and rendering mechanisms are not rendering them as interlinear
annotations, then you aren't really providing much assistance here
by using a mechanism designed for internal anchors and trying to
turn it into something it isn't really up to snuff for.

Frankly, you would be much better off making use of the Ruby annotation
schemes available in markup languages, which will give you better
scoping and attribute mechanisms.

Stop worrying a moment about Why are these characters standardized,
and why the hedoublehockeysticks can't I use them?! and think about
the problem that furigana or any other interlinear annotation rendering
system has to address:

  a. How are the annotations adjusted? Left-adjusted, centered,
 something else? And what point(s) are they synched on?

  b. If the annotated text or the annotation itself consist of
 multiple units, are there subalignments? E.g.

   note note note  note
   text text textextextext text

or

   note note  note note
   text text textextextext text

  c. Can an annotation itself be stacked into a multiline form?

   note note note
 nononononote
   text

  d. Can the text of the annotation itself in turn be annotated?

  e. Can the text have two or more coequal annotations? And if so,
 how are they aligned?

  e. If the annotation is in a distinct style from the text it
 annotates, how is that indicated and controlled?

  f. How is line-break controlled on a line which also has an
 annotation?

And so on. This is all the kind of stuff that clearly smacks to me
of document formatting concerns and rich text. Why anyone would consider
such things to be plain text rather escapes me.

--Ken

Re: Scripts in Unicode 4.0

2002-08-14 Thread Kenneth Whistler


John Hudson mused:

 Love the HOT BEVERAGE character, but where's the TALL LOWFAT SOYMILK MOCHA 
 FRAPPUCCINO? Come on guys, there's enough blank spaces in that block for 
 the entire Starbucks beverage menu, especially if you treat things like 
 EXTRA FOAM as a combining character.

Well, Starbucks is #550 on the Fortune 1000 list, which puts them ahead
of many other members of the Unicode Consortium. Perhaps we should just
hold out for them to join the consortium before we start worrying about
encoding their beverage menu.

--Ken

Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)

2002-08-14 Thread Kenneth Whistler


William Overington teased us all unmercifully with:

 It occurs to me that it is possible to introduce a convention, either as a
 matter included in the Unicode specification, or as just a known about
 thing, that if one has a plain text Unicode file with a file name that has
 some particular extension (any ideas for something like .uof for Unicode
 object file) 

...or to pick an extension, more or less at random, say .html

 that accompanies another plain text Unicode file which has a
 file name extension such as .txt, or indeed other choices except .uof (or
 whatever is chosen after discussion) then the convention could be that the
 .uof file has on lines of text, in order, the name of the text file then the
 names of the files which contains each object to which a U+FFFC character
 provides the anchor.
 
 For example, a file with a name such as story7.uof might have the following
 lines of text as its contents.
 
 story7.txt
 horse.gif
 dog.gif
 painting.jpg

This is a shaggy dog story, right?

 
 The file story7.uof could thus be used with a file named story.txt so as to
 indicate which objects were intended to be used for three uses of U+FFFC in
 the file story7.txt, in the order in which they are to be used.

Or we could go even further, and specify that in the story7.html file,
the three uses of those objects could be introduced with a very specific
syntax that would not only indicate the order that they occur in, but
could indicate the *exact* location one could obtain the objects -- either on 
one's own machine or even anywhere around the world via the Internet! And we could 
even include a mechanism for specifying the exact size that the object should be
displayed. For example, we could use something like:

img src=http://www.coteindustries.com/dogs/images/dogs4.jpg; width=380
 height=260 border=1

or

img src=http://www.artofeurope.com/velasquez/vel2.jpg;

 I can imagine that such a widely used practice might be helpful in bridging
 the gap between being able to use a plain text file or maybe having to use
 some expensive wordprocessing package.

And maybe someone will write cheaper software -- we could call it a browser --
that could even be distributed for free, so that people could make use of
this convention for viewing objects correctly distributed with respect to
the text they are embedded in.

Yes, yes, I think this is an idea which could fly.

--Ken

Re: The mystery of Edwin U+1E9A

2002-08-14 Thread Kenneth Whistler


John Cowan asked:

 Where does this strange beast come from?  

Semitic transliteration practice, if I recall correctly.

 Its name is LATIN SMALL LETTER
 A WITH RIGHT HALF RING, and the right half ring is indeed above the a.
 We don't have a RIGHT HALF RING ABOVE combining mark, so it only gets a
 compatibility decomposition.

It's not really an *above* diacritic, but a little 02BE hamza half ring
sitting at the upper right shoulder. The Unicode 3.0 glyph looks odd
to me -- the Unicode 2.0 glyph made more sense.

It's more akin to U+0149 as an oddball addition to the standard.

--Ken

 Who would need a lower-case letter with a unique diacritic, and no upper-case
 equivalent?  The U+1Exx block is random junk inherited from 10646 DIS 1,
 Does anyone understand it?

Re: Discrepancy between Names List Code Charts?

2002-08-14 Thread Kenneth Whistler


 This is my first posting to this list so please be gentle with me!

*pounces and begins to play with the little furry creature (gently)*

 Can someone help me with this confusion as I am unsure how I should 
 structure these WITH CEDILLA characters in fonts I'm working on.

See TUS 3.0, pp. 162-163 for a discussion of these characters with
cedillas (or ogoneks) below.

The characters whose names are XXX WITH CEDILLA often (but not always)
show variation between glyphs with cedillas and glyphs with commas
below (or even other hooklike shapes). This variation is conditioned
by at least: the shape of the letter itself, where a rounded bottom or
a flat line in the center of the bottom of the character lends itself
to a cedilla attachment, but a glyph such as that for a k does not;
by the particular language being rendered; by different typographical
traditions; and by font styles.

The characters whose names are XXX WITH COMMA BELOW are intended to
be just rendered with commas below -- ordinarily they should never show
up with a cedilla in the glyph.

For the Latvian letters you are probably best off following the conventions
as currently shown in the code charts and as used in Arial MS Unicode,
rather than earlier fonts.

 
 Am I just displaying my ignorance of European writing systems or does the 
 Unicode Names List and/or the Code Charts need updating???!!!

The names list is correct, and cannot be updated -- the character names
are fixed and unchangeable.

The Code Charts have been updated already, with the Unicode 3.0 (and
later) charts showing the glyph conventions recommended in the
discussion in the text of the standard, whereas the Unicode 2.0 (and
earlier) charts showed cedillas universally for all of the Latvian
characters.

--Ken

Re: Double Macrons on gh (was Re: Tildes on Vowels)

2002-08-13 Thread Kenneth Whistler


James Kass asked:

  Please note that both the UTC and WG2 have approved a new set
  of combining double accents:
  
  U+035D COMBINING DOUBLE BREVE
  U+035E COMBINING DOUBLE MACRON
  U+035F COMBINING DOUBLE LOW LINE
  
  snip
 
  Now, the question is, how long will it take for the fonts and
  browsers to catch up on those forms, as well??
  
 
 The other double combiner marks already work fairly well in default
 position in existing browsers.  These ought to work right out-of-the-
 box, once fonts include glyphs.
 
 Is it safe to include glyphs for the above referenced characters now?

Well none of the Unicode 4.0 extensions will be entirely safe to
use until after the December WG2 meeting in Tokyo.

But my personal opinion is that these 3 are pretty unlikely to
be disturbed by comments in national balloting between now
and then. 

--Ken

Re: Furigana

2002-08-13 Thread Kenneth Whistler



 I want to be able to send a Blissymbol string with a gloss in English 
 or Swedish attached. Nothing to do with Japanese whatsoever.

Basically, as for all things annotational or interlineating, this
is an excellent application for markup.

--Ken

Re: Furigana

2002-08-13 Thread Kenneth Whistler


Michael,

 At 14:16 -0700 2002-08-13, Kenneth Whistler wrote:
I want to be able to send a Blissymbol string with a gloss in English
   or Swedish attached. Nothing to do with Japanese whatsoever.
 
 Basically, as for all things annotational or interlineating, this
 is an excellent application for markup.
 
 When this was discussed in WG2 in Japan before they went in, I asked 
 specifically, could I use this method to put Anglo-Saxon glosses on 
 Latin text. The answer was positive, so it received my support. Were 
 these always pre-deprecated? Why are they in the standard if no one 
 is going to be allowed to use them?

Read the discussion which has been published in the Unicode Standard
ever since these things were available. TUS 3.0, pp. 325 - 326.

The annotation characters are used in internal processing when
   ^^^
 out-of-band information is associated with a character stream, very
 similarly to the usage of the U+FFFC OBJECT REPLACEMENT CHARACTER...

Usage of the annotation characters in plain text interchange is
 strongly discouraged without prior agreement between the sender
 
 and received because the content may be misinterpreted otherwise...


When an output for plain text usage is desired and when the receiver
^
 is unknown to the sender, these interlinear annotation characters
 
 should be removed...
 ^

The Japanese national body was very clear about this, and was opposed
to these going into the standard unless such clarifications were made,
to ensure that these were not intended for plain text interchange
of furigana (or other similar annotations).

--Ken

Re: Furigana

2002-08-13 Thread Kenneth Whistler


Michael Everson (in training as a curmudgeon) harrumpfed ;-)

 The Japanese national body was very clear about this, and was opposed
 to these going into the standard unless such clarifications were made,
 to ensure that these were not intended for plain text interchange
 of furigana (or other similar annotations).
 
 Well then they oughtn't to have been encoded.

Yes, we agree that hindsight is a wonderful skill. This function
would better be served by noncharacter code points, but nobody
had quite figured out how to articulate that yet.

But even at the time, as the record of the deliberations would
show, if we had a more perfect record, the proponents were clear
that the interlinear annotation characters were to solve an
internal anchor point representation problem. Nobody (well, maybe
somebody) expected them to serve as a substitute for a general
markup mechanism for indication of annotation, and in particular,
interlinear annotations. I recall at the time I pointed out that
as a linguist I had routinely made use of 4-line interlinear
annotation formats, and that this simple anchoring scheme couldn't
even begin to represent such complexities in a usable fashion.

--Ken

Re: Furigana

2002-08-13 Thread Kenneth Whistler


Tex asked:

 But does the standard address their removal by receivers (or
 intermediaries) , and does removing them include removing the contained
 annotation?

Yes and yes. p. 326:

On input, a plain text receiver should either preserve all characters
^^
or remove the interlinear annotation characters as well as the annotating
   ^^
text...


 
 I can imagine an application that doesn't support I.A. deciding the
 annotation is out of band and can't be preserved in its plain text
 output, and so justifiably strips it as well.
 Does the standard say what to do with for internal use only
 characters?

Yes. Unicode 3.1:

D7b: Noncharacter: a code point that is permanently reserved for
 internal use, and that should never be interchanged.

C10: A process shall make no change in a valid coded character
 representation other than the possible replacement of
 character sequences by their canonical-equivalent sequences
 or the deletion of noncharacter code points, if that process
 purports not to modify the interpretation of that coded
 character sequence.

The interlinear annotation characters fall in a gray zone, since
they are not noncharacters, but by rights ought to have been.
Since they are standard characters though, the standard has to
provide some guidelines -- and it is simply safer, if you encounter
and delete them, to also delete the annotation. You would be changing
the interpretation of the text, but in a knowing, intended manner.

 
 I would have thought the rule was to ignore and pass along.

In general, yes, as for everything else, including unassigned
code points. If your role in life is as a database, for example,
or some other kind of data source or data pipe, then minimal
meddling with the bytes is safest. But other kinds of processes
will do graduated manipulations, depending on what they are
aiming for.

--Ken

Re: Is U+0140 (l with middle dot) ever used?

2002-08-12 Thread Kenneth Whistler


Keld responded:

 On Fri, Aug 09, 2002 at 11:44:40PM +0100, Anto'nio Martins-Tuva'lkin wrote:
  
  Hm. But middle dot is not also a letter symbol. It's also used as a
  bullet, a tab filling, even a box-drawing char. Shouldn't Unicode
  provide a way to separate this duality?
 
 · has traditionally been used eg in word processors to visually display
 a blank character. But it was originally intended in ISO 8859-1 and 
 other places for the Catalan language, which uses it in words such
 ac paral·lel. 

However, one cannot ignore the rest of the manifest history of
this character. It also has long occurred in Code Page 437 and myriad
other IBM and Microsoft Code Pages (IBM GCGID SD63) with a long
history of ambiguous usage as punctuation and many other things.

 I think · is now listed in Unicode as a separator, and not 
 as alphabetical. 

It is actually listed with General Category Po (Punctuation, Other),
and not as one of the separator classes.

But it also has the diacritic property and the extender property,
which most punctuation characters do not.

Property-based implementations can take advantage of other properties
of U+00B7 to distinguish it from most punctuation.

 I think that is an error. How can we correct it?

Changing it out of the General Category Po would disturb what by
now is already a long legacy practice for many implementations. It
would cause way more problems than the putative problem it is
supposed to fix for Catalan. (This despite the fact that unlike the
Catalan usage, which actually is more reminiscent of the delimiter
usage of a middle dot, as in dictionary syl·la·bi·fi·ca·tion, there
are actually quite a number of technically-based orthographies,
in the Americas, at least, which use a middle dot simply as a long
vowel diacritic.)

Word delimitation depends on more than merely the General Category
value, anyway, so appropriate word boundary determination can be
developed for Catalan and other languages regardless of the
General Category Po value for U+00B7. (See DUTR #29 on this.)

And for identifiers, it is up to particular implementations to
determine whether inclusion or exclusion of U+00B7 makes sense
for their identifier syntax. What is gained for Catalan by
including U+00B7 in identifiers may be offset by confusion that
can set in against the usage of U+00B7 as a delimiter punctuation,
or as a representation of middle dot operators in mathematical
expressions.

--Ken

 
 Kind regards
 Keld

Re: Furigana

2002-08-12 Thread Kenneth Whistler


Michael asked:

 At 12:11 -0700 2002-08-08, Kenneth Whistler wrote:
 
 Ah, but read the caveats carefully. The Unicode interlinear
 annotation characters are *not* intended for interchange, unlike
 the HTML4 ruby tag. See TUS 3.0, p. 326. They are, essentially,
 internal-use anchor points.
 
 What does this mean? That if I have a text all nice and marked up 
 with furigana in Quark I can't export it to Word and reimport it in 
 InDesign and expect my nice marked up text to still be marked up?

Yes, among other things.

 
 Surely all Unicode/10646 characters are expected to be preserved in 
 interchange. What have I got wrong, Ken?

Your expectation that this stuff will actually work that way.

Yes, the characters will be preserved in interchange. But the
most likely result you will get is:

anchor1textanchor2annotationanchor3

where the anchors will just be blorts. You should not expect that
the whole annotation *framework* will be implemented, and certainly
not that these three characters will suffice for nice[ly] marked up...
furigana.

These animals are more like U+FFFC -- they are internal anchors
that should not be exported, as there is no general expectation
that once exported to plain text, a receiver will have sufficient
context for making sense of them in the way the originator was
dealing with them internally.

By rights, this whole problem of synchronizing the internal anchor
points for various ruby schemes should have been handled by
noncharacters -- but that mechanism was not really understood
and expanded sufficiently until after the interlinear annotation
characters were standardized.

--Ken

Double Macrons on gh (was Re: Tildes on Vowels)

2002-08-12 Thread Kenneth Whistler


A propos of this long thread about display of combining macrons in 
Middle English, morphing from tildes on vowels:

 In Mozilla 2002072104, Windows XP, I get perfectly good overlines on 
 yagh (now).  I'd be interested in seeing how it looked with the 
 combining macra.  

Please note that both the UTC and WG2 have approved a new set
of combining double accents:

U+035D COMBINING DOUBLE BREVE
U+035E COMBINING DOUBLE MACRON
U+035F COMBINING DOUBLE LOW LINE

for various transcriptions, including common English dictionary
pronunciation guide usages.

Once these become available in Unicode 4.0, I believe the preferred
representation to use for the gh-digraph-overlined would be:

g, combining-double-macron, h

Now, the question is, how long will it take for the fonts and
browsers to catch up on those forms, as well??

It might make sense to start testing them now with:

n, combining-double-tilde, g

to see how well they do. (U+0360 COMBINING DOUBLE TILDE)

--Ken

P.S. I'm getting fine display of all the combining marks for
the St. Erkenwald test page with
MSIE 6.0 running on Windows NT 4.0 (!) with Arial Unicode MS --
only the yoghs are missing. So I'm not sure what the problem
is that people are having on Windows XP.

Re: Taboo Variants

2002-08-09 Thread Kenneth Whistler


Lest everyone go scrabbling off the deep end and drown on
this particular thread, I would like to point out the following
facts:

U+2FDF IDEOGRAPHIC TABOO VARIATION INDICATOR

was accepted by the UTC on April 30, 2002. However, when the
proposal was taken into WG2 it met a wall of opposition led
by China. WG2 did *NOT* accept the character, and it is not
a part of the FPDAM 2 currently being ballotted for inclusion
in 10646.

The UTC will have to deal with this mismatch (along with a number
of others) in its upcoming meeting this month.

China's clear preference is to simply encode all the taboo
variants as separate characters. At the WG2 meeting, they
pointed out a number of instances already encoded in Extension B,
as you have. And with China not wanting an IDEOGRAPHIC TABOO
VARIATION INDICATOR encoded, many other members of WG2 will
defer to their opinion on the topic.

This issue clearly needs to be worked further in the IRG context
before a consensus will emerge.

At any rate, don't consider it a done deal. What
matters is what eventually gets published in the final, approved
Amendment 2 for ISO/IEC 10646, which *will* match what we
publish in Unicode 4.0.

--Ken

Re: Furigana

2002-08-08 Thread Kenneth Whistler


Stefan wrote:

  Many Japanese word processors already have that capability.  HTML4 has 
  ruby tag exactly for that purpose.
 
 And Unicode has characters for that purpose, too.
 
   Unicode: U+FFF9 kanji U+FFFA furigana U+FFFB  
   HTML4:  RUBYRD  kanji  /RDRT  furigana  /RT/RUBY 
 
 
 Examples:
 ?$B4A;z(B?$B$U$j$,$J(B?
 $B4A;z$U$j$,$J(B

Ah, but read the caveats carefully. The Unicode interlinear
annotation characters are *not* intended for interchange, unlike
the HTML4 ruby tag. See TUS 3.0, p. 326. They are, essentially,
internal-use anchor points.

--Ken

Compatibility and Politics (was Re: Digraphs as Distinct Logical Units)

2002-08-08 Thread Kenneth Whistler


Roozbeh asked:

  Expecting the compatibility decompositions to serve this purpose
  effectively is overvaluing what they can actually do.
 
 I would love to hear your opinion about what compatibility decompositions
 *are* for, then. I feel a little confused here.

They are helpful annotations to an earlier version of the standard
that got swept up first by changing expectations and then were caught
in a normative stasis trap by the normalization specification.

Originally, they were a shorthand way of saying things like:

This character is not really a 'good' Unicode character -- it
should be thought of as a font variant of X.

This character is not really a 'good' Unicode character -- it
should be thought of as effectively representing the sequence of
X, Y, and Z.

And so on.

The terminology of compatibility character confused everyone,
including the people writing the standard, since it meant, on the
one hand, characters that didn't really fit the Unicode text model,
but which were encoded for compatibility with important standards,
for ease of round-trip conversions, mostly. On the other hand, it
came to mean characters that had compatibility decompositions, once
those were officially specified in the Unicode 2.0 publication, since
most compatibility characters had compatibility decompositions.
This situation was further confused by the abortive early attempt to
encode compatibility characters in a compatibility zone, which
resulted in people assuming that if a character was in that zone
it automatically *was* a compatibility character and (later) that
it should also have a compatibility decomposition.

However, compatibility decompositions were originally assigned
pretty much by a seat-of-the-pants method, without a clear
implementation model to guide all of the decisions. As the UTC
approached the critical milestone of Unicode 3.0 (and normalization),
many of the earlier decompositions were refined and further
rationalized, but they still retained some of the helter-skelter
context of their annotational origins.

The intuition was that the compatibility decompositions sort of
made sense for such things as fallback, loose comparison (e.g.
for collation and searching), normalizing, and such. However,
when detailed specifications started to be written for such
things, guided by implementation experience, it turned out
that the compatibility decompositions were typically in the
ballpark, as it were, but not correct in detail for any one
purpose, let alone all purposes.

And the publication of UAX #15 Normalization drastically turned
things on their head. Instead of being annotational, and fixable,
compatibility decompositions became part of the normative
definition of NFKD and NFKC, and became unfixable, because of
the requirements of normalization stability.

So post-Unicode 3.0, the right way to think of the compatibility
decomposition mappings is as the normative data used to define
NFKD and NFKC. They bear some resemblance to relationships
between characters and character sequences that may be useful
in other processes, but in *all* cases should not be taken as
a sufficiently precise set of classifications and equivalences
for other processes -- there are always going to be exceptions,
particularly since compatibility decompositions can no longer
be fixed as a result of tuning based on implementation experience.

   providing backup rendering when they lack the glyph,
  
  This seems unlikely to be particularly helpful in this *particular*
  case.
 
 Believe me, it really is. I'm implementing char-cell rendering for Arabic
 terminals, and when it comes to Arabic ligatures, since I don't want two
 get into a mess of double width things, I just decompose than ligature,
 and render the equivalent string. It's not as genuine as may be, but it's
 automatic, simple, clean, and conformant.

For this kind of application, then, you simply add on decompositions
for whatever else cannot be conveniently rendered in a char-cell.
Arabic terminal applications have often already departed from what
the Unicode Standard specifies in the way of compatibility decompositions
by doing special handling of character tails in a separate cell,
for example. Note that there isn't any compatibility mapping for
U+FEB1 (isolated seen) -- U+FEB3 (initial seen) + U+FE73 (tail fragment),
even though that might be what a Arabic terminal could do for display.

It isn't non-conformant with the Unicode Standard to transform
Unicode characters to alternate representations -- such as
a glyph stream for terminal rendering -- it would only be
nonconformant to *claim* that such a glyph stream is NFKD data
when it departs from that specification.

 Some other point: We like to discourage the usage of Arabic Presentation
 Forms, don't we? 

Of course. They are compatibility characters for working with the
existing legacy code pages that encoded Arabic that way.

 That is mentioned in TUS 3.0 at the end of the chapter
 about Arabic. All the

Re: Digraphs as Distinct Logical Units

2002-08-02 Thread Kenneth Whistler


 At 04:48 PM 02-08-02, Kenneth Whistler wrote:
 
 ... and some extreme case
 orthographies are known that employ up to *hepta*graphs!
 
 Ooo, I want one! Do you have any examples, Ken?

If I recall correctly, that one was a technical orthography
of Nama -- but I can't track down an online reference at the
moment.

In the meantime, for a sampler of some of the wild multigraphs
used in various orthographies for Khoi and San languages, try

http://www.african.gu.se/khsnms.html

Examples: '//Ng  -- there's a pentagraph for you.

//Kx', //Kh' and so on.

-- //Kh'en

P.S. The San peoples are now apparently vigorously objecting
to being lumped with the Khoi peoples as Khoisan. See:

http://allafrica.com/stories/200104270244.html

Re: Missing character glyph- example

2002-08-01 Thread Kenneth Whistler


 As a clarification, here is a sample web page:
 
 http://www.cardbox.com/missing.htm
 
 The requirement is to be able to display the first paragraph of the 
 page in such a way that it makes sense in its reference to the text 
 on the rest of the page.
 
 The character after the word this: in the first paragraph cannot 
 be reliably represented by any existing Unicode character.
 
 Nevertheless, I believe it is legitimate to want to say what the 
 first paragraph says. 

Well, I would put it differently, if it were my web page.
Rather than:

quote
If any of the following text contains characters such as this: {blort}
then please change to a different font, or download a more recent
version of your current font.
/quote

I would suggest something more along the line of:

quote
If you have trouble displaying any of the characters in
the text on this page, please consult a href=xxx.html
Troubleshooting Display Problems/a.
/quote

Then the troubleshooting page could provide a nice explanation
of the problem, show several neatly formatted *graphics* of
the kind of nondisplayable glyph issues (with alternate forms
picked from various fonts) that a user might run into, and
then give helpful links to actual font resources that would
help, or in the case of specialized data, actually provide a
usable font directly.

Such an approach:

A. Avoids font-specific circularity in your attempt to explain
to a user what is going on when the display is broken.

B. Provides much more useful information that will actually
have a better chance of helping the user get by the problem.
Also, since the problem(s) may not only be some nondisplayable
glyphs, the approach is extensible for whatever display help
is needed.

C. Doesn't depend on dubious assignments of a code point in
Unicode for a confusing (non-)use.

But if you insist on having a code point to stick directly in
a sentence like that above, I'd take the cue from James Kass:

 The missing glyph is the first glyph in any font.  This is mapped to
 U+ and the system correctly substitutes the glyph mapped to
 U+ any time a font being used lacks an outline for a called
 character.

Thus, you have a reasonably good chance that if you try to
purposefully display the character U+, you will get the
missing glyph for the font in use. (Unless the application is
filtering out NULL characters.)

--Ken

Re: Missing character glyph

2002-07-31 Thread Kenneth Whistler


Asmus wrote:

 At 08:40 PM 7/30/02 -0700, Doug Ewell wrote:
 a code-point that has no
   character assigned to it (and is not likely to get one), e. g. U+03A2
 
 No code point is safe.

True enough. But then I figure Plane 13 characters like
U+DEAD1 are pretty unlikely to be assigned to a character 
in our lifetimes (or our children's lifetimes). 
That one is *reasonably* safe to use as an example. ;-)

--Ken

*remembers when he used to use 0xdeadbeef as a magic
number in tests because it was easy to spot in hex
displays*

 
 A./

Re: REALLY not Tamil - changing scripts (long)

2002-07-29 Thread Kenneth Whistler


  It's *much* easier -- and, in the long term, safer -- for them to
  select from the extensive inventory of characters available in Unicode and
  to avoid using ASCII punctuation characters with redefined word-building
  semantics.
 
 I don't get what you are saying here, why should people be limited to 
 ASCII punctuation characters? 

That isn't what Peter was saying. You are confused here by your misinterpretation
of what he was saying.

The recommendation that Peter was making is that people devising orthographies
for languages should stick to Unicode letters for the letters of their
orthography. (If the script in question is Latin, as most new orthographies
are, then there are *hundreds* of Latin letters to choose from in the standard.)

What orthography developers should avoid is using characters like 7  !
$, ' and so on as letters of their orthography, since those are certain
to cause all kinds of havoc with word-break and other processes for standard
software -- or even lead to the kind of absurdities as people wanting illegal
constructs like: 'jo'Abr@cd@br.com, which locales can*not* fix.

Just as choices about rational orthographies used to have to take ease of
use on typewriters as a major factor involved (to fail to do so would be
to condemn legions of people to wretched inefficiency) -- so choices about new
rational orthographies should now being taking ease of use on computers as
a major factor involved. That is just a realistic approach that any *serious*
deviser of an orthography should be taking into account.

 With GNU libc you can declare your own set
 of punctuation characters in the locale, and they can be any 10646
 character. 

Peter was talking about the opposite case. But you should examine carefully
what the implications are of your suggestion here. If I were to make the
absurd choice of picking 18 Chinese characters to serve as my punctuation
characters, and then went through the exercise of declaring my own
locale with GNU libc, I would only be guaranteeing that my locale (and all
my text data) would only function correctly in a microscopic environment
that I defined (or could browbeat a few others to share).

The reason for sticking to the Universal Character Set and for sticking
to standardized properties for the characters in that set is to
guarantee widespread interoperability and to guarantee that my text,
in my language, works correctly in all off-the-shelf software -- not
merely in my own hacked-up locale.

Serious orthography designers should not allow themselves to get
stuck in such dead-end traps.

--Ken


 Or are you referring to the specific locale syntax from
 POSIX/TR 14652?
 
 Kind regards
 Keld

Re: REALLY not Tamil - changing scripts (long)

2002-07-29 Thread Kenneth Whistler


Keld wrote:

 In Linux, 

*Which* Linux? :-) Caldera OpenLinux, Corel Linux, Debian GNU/Linux,
Elfstone Linux, Libranet Linux, Linux-Mandrake, Phat Linux, Red Hat Linux,
Slackware Linux, Stampede GNU/Linux, Storm Linux, SuSE Linux, or TurboLinux?
Or for that matter another dozen international distribution Linuxes,
or a half-dozen on the Macintosh?

 for a specific locale, it is relatively easy to get the new locale
 to work on all off-the-shelf software: you need to write the locale, and
 submit it to the glibc people, but then - in about 6 months or so, it
 would be available on all mainsteam new Linux distributions, off the
 shelf. 

While most of the Linuxes do make use of GNU/C, they don't all do so
at the same levels or with the same versions of glibc, and certainly
not all at the same times.

 And all applicatuions would adhere to it, given Linux' advanced
 i18n technology.

I think this is talking through your hat at bit. Do you think that
Adobe Acrobat Reader 4.0 PDF viewer on Linux-Mandrake is going to
just automatically pick up an Ethiopic locale setting because I
happened to submit a locale proposal to the glibc people 6 months
earlier. I don't think so.

--Ken

Re: (long) Making orthographies computer-ready (was not Telephoning Tamil)

2002-07-29 Thread Kenneth Whistler



 One that occurs to me might be the Khoisan languages of Africa, 
 which I believe commonly use ! (U+0021) for a click sound. 
 This is almost exactly the same problem you are describing for Tongva.

U+01C3 LATIN LETTER RETROFLEX CLICK (General Category Lo) was
encoded precisely for this. It is to be *distinguished* from
U+0021 '!' EXCLAMATION MARK to avoid all of the processing problems
which would attend having a punctuation mark as part of your letter
orthography. A Khoisan orthography keyboard should distinguish the
two characters (if, indeed, it makes any use at all of the exclamation
mark per se), so that users can tell them apart and enter them
correctly.

--Ken

God's and devil's details (was: Re: Unicode certification - quote correction and attribution)

2002-07-26 Thread Kenneth Whistler


[Tex Texin]

 Actually, (or so I have heard) it is God dwells in the details of our
 work, I have seen it attributed to Einstein, more generally to shakers,
 and others. So Ludwig might have been quoting others.

[Ken Whistler]

   And the devil is in the details. Looking a bit at your suggestions,

[James Agenbroad]

  No, God is in the details Ludiwg Mies van der Rohe (1886-1969) said.

And the Word Court rules:

http://www.theatlantic.com/issues/2000/01/001wordcourt.htm

And since I'd rather be associated with the likes of Einstein,
Flaubert, and van der Rohe than Nitze, Reagan, and Perot, maybe
I'll shift back to God is in the details.

--Ken

Der lieber Gott lebt im Detail.
Le bon Dieu est dans le detail.

  And that's the beauty of Unicode IMHO.

Re: God's and devil's details (was: Re: Unicode certification - quote correction and attribution)

2002-07-26 Thread Kenneth Whistler


The correct Einsteinian German appears to be:

Der liebe Gott steckt im Detail  (cf. http://www.benecke.com/einsteinprogramm.html)

(and there are German alternatives such as Gott lebt im Detail)

and the satanic alternate is:

Der Teufel liegt im Detail  (very common, actually, but maybe just calqued
  from English)

Who knows, maybe the concepts were borrowed from Latin to begin
with, anyway. And as we can see from this thread God and the Devil
do seem to be in the details!

--Ken

Re: Abstract character?

2002-07-23 Thread Kenneth Whistler


Following up on several responses on this thread.

Mark Davis said:

 A small correction to Ken's message:
 
 The Unicode scalar value
 definitionally excludes D800..DFFF, which are only code unit
 values used in UTF-16, and which are not code points associated
 with any well-formed UTF code unit sequences.
 
 The UTC in has decided to make scalar value mean unambiguously the
 code points ..D7FF, E000..10, i.e., everything but surrogate
 code points. 

Correct.

 While surrogate code points cannot be represented in
 UTF-8 (as of Unicode 3.2), the UTC has not decided that the surrogate
 code points are illegal in all UTFs; notably, they are legal in
 UTF-16.

Not to pick nits here... oh well, o.k., I'll pick nits.

I stated that D800..DFFF ... are not code points associated
with any well-formed UTF code unit sequence. I believe, as stated,
that that is correct. An isolated surrogate in UTF-16 is *not*
a well-formed UTF code unit sequence. Even by the disputed text
of Unicode 3.0, an isolated surrogate code unit in UTF-16 would
be an irregular code value sequence.

And with the updated relevant text in Unicode 3.2, I think
there is even less wiggle-room. The last vestige of irregular
code unit sequence vanished in Unicode 3.2 when the loophole for
UTF-8 was closed. The Unicode 3.2 standard now reads:

Terminology to distinguish ill-formed, illegal, and irregular
code unit sequences is no longer needed. There are no irregular
code unit sequences, and thus all ill-formed code unit sequences
are illegal. It is illegal to emit or interpret any ill-formed
code unit sequence. Unicode 4.0 will revise the terminology
and conformance clauses in light of this.

 
 Ken is pushing for this change; I believe it would be a very bad idea.

I believe it is a worse idea to carry forward the claim that
(isolated) surrogate code points cannot be represented in
UTF-8 (as is definitely the case for Unicode 3.2) while they
can be represented in UTF-16.

 (I think the reasons have already appeared on this list, so I am not
 trying to reopen the discussion; just state the current situation.)

Doug Ewell followed up:

 UTF-16 does not allow the representation of an unpaired surrogate 0xD800
 followed by another, coincidental unpaired surrogate 0xDC00.  (It maps
 the two to U+1.)  Among the standard UTFs, only UTF-32 allows the
 two to be treated as unpaired surrogates. 

Actually, not that, either. 

 In fact, before UTF-8 was
 tightened up in 3.2, the only UTF that DID NOT permit these two
 coincidental unpaired surrogates was UTF-16.
 
 UTF-8:  D800 DC00 == ED A0 80 ED B0 80 (no longer legal)
 UTF-32:  D800 DC00 == D800 DC00

This is ill-formed in UTF-32, and thereby, illegal.

 - but -
 UTF-16:  D800 DC00 == D800 DC00 == 1

David Hopwood responded:

 I think it would be a mistake for the standard to refer to surrogate
 code points. 

I think this was already definitely decided by the UTC.

 The term code point is used for other CCS's where there
 may also be gaps in the code space; in that case, the gaps are not
 considered valid code points. 

I am sympathetic with this point of view, but it isn't easy to draw
such a line in practice. Look at the various Asian DBCS sets -- they
often had ranges of byte values that were considered invalid as
parts of encoded characters, and if you mapped them out to an integral
space, you would end up with ranges of integers that were invalid as
code points. But when push came to shove, various of these encodings
just appropriated some of these ranges to extend themselves, and
filled them with more characters. What was an invalid code point
became a valid (and assigned) code point.

 When 0xD800..0xDFFF are used in UTF-16,
 they are used as code units, not code points. As Unicode code points,
 0xD800..0xDFFF are (or at least should be) invalid in the same sense
 that 0x11 is.

As Unicode code points they are invalid in a different sense than
0x11 is, actually. 0x11 could, by the integral transforms
involved, be represented by UTF-8 or by UTF-32, but not by UTF-16.
0xD800 could, in principle, be represented by UTF-16, if you
allowed the range, but is ruled to be ill-formed in all three
UTF's, to avoid the kinds of irregular sequences that the UTC was
just at pains to eliminate.

 
 I.e. IMHO Unicode scalar value and Unicode code point should be
 synonyms, with the set of valid values 0..0xD7FF, 0xE000..0x10.

I think the distinction in ranges is a useful one, since it allows
for a bijective definition of the UTF's, based on the Unicode scalar
value, but it also gives a meaning to the complete integral range
for the code points, as demanded by some of the implementers.

 code point should be defined as an integer corresponding to an
 encoded character in any CCS, not just Unicode.

This doesn't really work, since it doesn't account for the
unassigned (reserved) code points, nor the noncharacters.
The Unicode architecture for its codespace is

Re: Abstract character?

2002-07-22 Thread Kenneth Whistler


Lars Marius Garshol asked:

 I'm trying to find out what an abstract character is. I've been
 looking at chapter 3 of Unicode 3.0, without really achieving
 enlightenment. 
 
 The term Unicode scalar value (apparently synonymous with code point)
 seems clear. It is the identifying number assigned to assigned
 Unicode characters.

Here is one of my attempts at a more rigorous term rectification:

Abstract character

   that which is encoded; an element of the repertoire (existing
   independent of the character encoding standard, and often
   identifiable in other character encoding standards, as well
   as the Unicode Standard); the implicit basis of transcodings.

   Note that while in some sense abstract characters exist a
   priori by virtue of the nature of the units of various writing
   systems, their exact nature is only pinned down at the point
   that an actual encoding is done. They are not always obvious,
   and many new abstract characters may arise as the result of
   particular textual processing needs that can be addressed by
   characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER,
   etc., etc.)

Code point

   A number from 0..10; a point in the codespace 0..10.

Encoded character

   An *association* of an abstract character with a code point.

Unicode scalar value

   A number from 0..D7FF, E000..10; the domain of the
   functions which define UTF's. The Unicode scalar value
   definitionally excludes D800..DFFF, which are only code unit
   values used in UTF-16, and which are not code points associated
   with any well-formed UTF code unit sequences.

Assignment (of code points)

   Refers to the process of associating abstract character with
   code points. Mathematically a code point is
   assigned to an abstract character and an abstract
   character is mapped to a code point.

   This is distinguished from the vaguer sense of assigned
   in general parlance as meaning a code point given some
   designated function by the standard, which would include
   noncharacters and surrogates.

 
 So far, so good. Some questions:
 
  - are all assigned Unicode characters also abstract characters?

Yes. Or rather: all encoded characters are assigned to abstract
characters.

(See above for my distinction between assigned and
designated, which would apply to noncharacters and surrogate
code points -- neither of which classes of code points get
assigned to abstract characters.)

 
  - it seems that not all abstract characters have code points (since
abstract characters can be formed using combining characters). Is
that correct?

Yes. (Note above -- abstract characters are also a concept which
applies to other character encodings besides the Unicode Standard,
and not all encoded characters in other character encodings automatically
make it into the Unicode Standard, for various architectural reasons.)

 
  - do U+00C5 (Å) and U+0041, U+030A (A followed by combining ring
above) represent the same abstract character?

Yes. That is the implicit claim behind a specification of canonical
equivalence.

--Ken

 
 Would be good if someone could clear this up.
 
 -- 
 Lars Marius Garshol, Ontopian URL: http://www.ontopia.net 
 ISO SC34/WG3, OASIS GeoLang TCURL: http://www.garshol.priv.no

Re: ISO/IEC 10646 versus Unicode

2002-07-18 Thread Kenneth Whistler


Marion Gunn wrote:

 The immediate attraction ang great advantage of Unicodes vision was its
 simplicity/focus: after an unsteady and argumentative start, its
 founders committed Unicode to the IMPLEMENTATION of10646, and became
 very specific (loud) about not calling it a STANDARD (note to newcomers
 - check out the archives of the relevant lists). 

Well, I'm one of the founders, and I don't recall this particular
dichotomy, certainly not LOUDLY stated. I dug around for awhile in my
own collection of 1989 - 1993 email, and didn't find any obvious such
claims, although I could well have missed someone's assertion. Perhaps
you can cite an example of what you are talking about.

In any case, the existence of the Unicode Standard, published as a *standard*
in 1991, with Volume 2 in 1992, clearly self-proclaiming its status as
a standard, would seem to belie your claim. Read the text -- even in
Volume 2 of Unicode 1.0, published while the merger was underway,
and containing a number of pages devoted to the details of how the
repertoire of the Unicode Standard was synched with the then to-be-published
ISO/IEC 10646-1:1993, the Unicode Standard didn't proclaim that it was
merely an implementation of 10646. Sample of that text:

  ... These additional elements do not create incompatibility between
   the Unicode standard and ISO DIS 10646. They are summarized here
   in order to clarify the relationshp between the two standards...

  While ISO 10646 contains no means of explicitly identifying or
   'declaring' Unicode values as such, the Unicode standard may be
   considered as encompassing the entire repertoire of 10646 and
   having the following profile values: ...-- p. 3

 I expected the ad hoc Uncode
 consortium itself to voluntarily disband in 3-5 years (wrong again)
 having successfully fulfilled its brief of producing implementations of
 10646 with flying colours (again wrong, as it has yet to do that).

I think this is a misunderstanding of the self-understood brief of
the Unicode Consortium. It was ad hoc, certainly, but its purpose was
not producing implementations of 10646. The original Purpose of
the Unicode Consortium, as stated in the Bylaws filed in the Articles
of Incorportation of the corporation on January 3, 1991 was:

  This Corporation's purpose shall be to standardize, maintain and
   promote a standard fixed-width, 16-bit character encoding that
   provides an allocation for more than 60,000 graphics characters.

That was two years *before* ISO/IEC 10646-1:1993 was published.
To reflect changing reality, following the publication of the Unicode
Standard and the introduction of encoding forms (UTF-*), the Bylaws
have subsequently been amended to:

  This Corporation's purpose shall be to extend, maintain and promote
   the Unicode Standard.

This was and is quite clear. The Unicode Consortium is a standardization
organization, and its activities revolve around the care and support
of the Unicode Standard. It never has been a group just dedicated to
figuring out how to implement 10646.
 
 but that does not
 mean any  withdrawal of EGTs initial and longstanding support of
 Unicode, in principal (although it seems to have produced only one thing
 to date, viz., a book called The Unicode Standard (where I expected to
 read  Implementation).

See above.

--Ken

Re: Basic question: types of diacritics marks

2002-07-18 Thread Kenneth Whistler


Adam asked:

 I have a very basic question. What would be the implementation differences
 of diacritics marks in a font? For example, we'd consider:
 
 U+00B4 acute accent
 U+02CA modifier letter acute accent
 U+0301 combining acute accent
 
 What are the common recommendations regarding the glyphs in a font
 (TrueType), especially with respect to the metrics? Should I support all
 three above codepoints? If so, can I do this with one glyph? Or should I
 provide separate glyphs?

To elaborate on what Michael Everson said, I think the answer here is
that you should probably provide separate glyphs.

U+00B4 would typically have the spacing width of an en, thereabouts, since
it is the spacing clone of a combining mark acute, and on average, you
would expect it to have a en character width. It also gets used for
fallback displays, as for Latin-1 `curly´ quotes using grave and acute
instead of real quotation marks (an extension of ASCII `curly' quotes
using grave and apostrophe), for primes in character sets that don't
really have one (also as an alternate to apostrophe) [cf. U+2032], for 
email-type indication of accents on l´et´t´ers´ that you don't have actual 
codes for, and the like. So you need to make it look appropriate for
such uses.

U+02CA should typically be a little narrower (I think). It really is a modifier
letter intended to precede or follow a regular letter, usually indicating
a tone or stress for a syllable (as an alternative to the acute actually 
placed over a letter in the same function).

And U+0301 needs to be rendered over letters. Its exact placement will
depend on the width and height of the letter it is placed over.

Of course, your mileage may vary, depending on what you are trying to
do with your font design. And John Hudson provided the technical details
regarding what happens inside the font.

 
 And, briefly, what are the principal differences between the three types of
 marks?

Michael Everson answered this one in terms of functionality.

--Ken

What Unicode Is (was RE: Inappropriate Proposals FAQ)

2002-07-12 Thread Kenneth Whistler


Suzanne responded:

  Maybe Unicode is more of a shared set of rules that apply to 
  low level data structures surrounding text and its algorithms 
  then a character set.
 
 Sounds like the start of a philosophical debate. 
 
 If Unicode is described as a set of rules, we'll be in a world of hurt.

 (On a serious note, these exceptions are exactly what make writing some
 sort of is and isn't FAQ pretty darned hard. 

Hmm. Since the discussion which started out trying to specify a
few examples of what kinds of entities would be inappropriate to
proffer for encoding as Unicode characters seems to be in danger
of mutating into the recurrent What is Unicode? question,
perhaps its time to start a new thread for the latter.

And now for some ontological ground rules.

When trying to decide what a thing is, it helps not to use
an attribute nominatively, since that encourages people to
privately visualize the noun the attribute is applied to,
but to do so in different ways -- and then to argue past each
other because they are, in the end, talking about different
things.

Unicode is used attributatively of a number of things, and
if we are going to start arguing/discussing what it is, it
would be better to lay out the alternative its a little
more specifically first.

1. The Unicode *Consortium* is a standardization organization.
It started out with a charter to produce a single standard,
but along the way has expanded that charter, in response to
the desire of its membership. In addition to The Unicode
Standard, it now has adopted a terminology that refers to
some of its other publications as Unicode Technical Standards
[UTS], of which two formally exist now: UTS #6 SCSU, and
UTS #10 Unicode Collation Algorithm [UCA].

It is important to keep this straight, because some people,
when they say Unicode are talking about the *organization*,
rather than the Unicode Standard per se. And when people talk
about the standard, they are generally referring to The
Unicode Standard, but the Unicode Consortium is actually
responsible for several standards.

2. The Unicode *Standard* itself is a very complex standard, consisting
of many pieces now. To keep track of just what something like
The Unicode Standard, Version 3.2 means, we now have to
keep web pages enumerating all the parts exactly -- like
components in an assemble-your-own-furniture kit. See:
http://www.unicode.org/unicode/standard/versions/

In any one particular version, the Unicode Standard now consists
of a book publication, some number of web publications
(referred to as Unicode Standard Annexes [UAX]), and a
large number of contributory data files -- some normative and
some informative, some data and some documentation. These
definitions, including the exact list of contributory
data files and their versions, are themselves under tight
control by the Unicode Technical Committee, as they constitute
the very *definition* of the Unicode Standard. It is not
by accident that the version definitions start off now with
the following wording:

The Unicode Standard, Version 3.2.0 is defined by the following
list...

and so on for earlier versions.

3. The Unicode *Book* is a periodic publication, constituting the
central document for any given version of the Unicode *Standard*,
but is by no means the entire standard. The book, in turn,
is very complex, consisting of many chapters and parts, some
of which constitute tightly controlled, normative specification,
and some of which is informative, editorial content.

The book now also exists in an online version (pdf files):
http://www.unicode.org/unicode/uni2book/u2.html
which is *almost* identical to the published hardcover book,
but not quite. (The Introduction is slightly restructured,
the online glossary is restructured and has been added to,
the charts are constructed slightly differently and have
introductory pages of their own, etc.)

4. The Unicode *CCS* [coded character set] is the mapping of the
set of abstract characters contained in the Unicode repertoire
(at any given version) to a bunch of code points in the
Unicode codespace (0x..0x10). Technically speaking, it
is the Unicode *CCS* which is synchronized closely with
ISO/IEC 10646, rather than the Unicode *Standard*. 10646 and
the Unicode CCS have exactly the same coded characters (at
various key synchronization points in their joint publication
histories), but the *text* of the ISO/IEC 10646 standard doesn't
look anything like the *text* of the Unicode Standard, and the
Unicode Standard [sensum #2 above] contains all kinds of
material, both textual and data, that goes far beyond the scope
of 10646. 

There are other standards produced by some national
bodies that are effectively just translations of 10646
(GB 13000 in China, JIS X 0221 in Japan), but the Unicode Standard
is nothing like those.

Finally, the attribute Unicode ... can be applied to all
kinds of other things characteristic of the Unicode Standard,
including algorithms for the

Re: Hmm, this evolved into an editorial when I wasn't looking :) was: RE: Inappropriate Proposals FAQ

2002-07-12 Thread Kenneth Whistler


Barry Caplan wrote:

  At 01:27 PM 7/11/2002 -0400, Suzanne M. Topping wrote:
  Unicode is a character set. Period. 
  
  Each character has numerous 
  properties in Unicode, whereas they generally don't in legacy 
  character sets.
 
 Each character, or some characters?
 
 
 For all intents and purposes, each character. 
 So, each character has at least one attribute. 

Yes. The implications of the Unicode Character Database include
the determination that the UTC has normatively assigned properties
(multiple) to all Unicode encoded characters.

Actually, it is a little more subtle than that. There are some
properties which accrue to code points. The General Category and
the Bidirectional Category are good examples, since they constitute
enumerated partitions of the entire codespace, and API's need to 
return meaningful values for any code point, including unassigned ones.

Other properties accrue more directly to characters, per se.
They attach to the abstract character, and get associated with
a code point more indirectly by virtue of the encoding of that
character. The numeric value of a character would be a good example
of this. No one expects an unassigned code point or an assigned
dingbat character or a left bracket to have a numeric value property
(except perhaps a future generation of Unicabbalists).
 
 There are no corresponding features in other character sets usually.

Correct. Before the development of the Unicode Standard, character
encoding committees tended to leave that property assignments
either up to implementations (considering them obvious) or up
to standardization committees whose charter was character
processing -- e.g. SC22/WG15 POSIX in the ISO context.

The development of a Universal character encoding necessitated
changing that, bringing character property development and
standardization under the same roof as character encoding.

Note that not everyone agrees about that, however. We are
still having some rather vigorous disagreements in SC22 about
who owns the problem of standardization of character properties.

 A common definition of character set is a list of character 
 you are interested in assigned to codepoints. That fits most 
 legacy character sets pretty well, but Unicode is sooo much 
 more than that.

Roughly the distinction I was drawing between the Unicode CCS
and the Unicode Standard.

 But what if we took a look at it from a different point of view, 
 that the standard is a agreed upon set of rules and building 
 blocks for text oriented algorithms? Would people start to 
 publish algorithms that extend on the base data provided so 
 we don't have to reinvent wheels all the time?

Well the Unicode Standard isn't that, although it contains
both formal and informal algorithms for accomplishing various
tasks with text, and even more general guidelines for how to
do things.

The members of the Unicode Technical Committee are always
casting about for areas of Unicode implementation behavior
where commonly defined, public algorithms would be mutually
beneficial for everyone's implementations and would assist
general interoperability with Unicode data.

To date, it seems to me that the members, as well as other
participants in the larger effort of implementing the Unicode
Standard, have been rather generous in contributing time
and brainpower to this development of public algorithms. The
fact that ICU is an Open Source development effort is enormously
helpful in this regard.

 If I were to stand in front of a college comp sci class, 
 where the future is all ahead of the students, what proportion 
 of time would I want to invest in how much they knew about legacy 
 encodings versus how much I could inspire them to build from and 
 extend what Unicode provides them?

This problem, of Unicode in the computer science curriculum,
intrigues me -- and I don't think it has received enough attention
on this list.

One of my concerns is that even now it seems to be that CS
curricula not only don't teach enough about Unicode -- they basically
don't teach much about characters, or text handling, or anything
in the field of internationalization. It just isn't an area that
people get Ph.D.'s in or do research in, and it tends to get
overlooked in people's education until they go out, get a job
in industry and discover that in the *real* world of software
development, they have to learn about that stuff to make software
work in real products. (Just like they have to do a lot of
seat-of-the-pants learning about a lot of other topics: building,
maintaining, and bug-fixing for large, legacy systems; software
life cycle; large team cooperative development process;
backwards compatibility -- almost nothing is really built from
scratch!)

 
 The major work ahead is no longer in the context of building 
 a character standard. Time is fast approaching to decide to keep 
 it small and apply a bit of polish, or focus on the use and usage 
 of what is already there in Unicode by those who

RE: Saying characters out loud (derives from hash, pound,octothorpe?)

2002-07-11 Thread Kenneth Whistler


Joe sent around a classic version of Waka waka bang splat,
but my favorite is a slightly pared-down version set
to music for a four-part round, lyrics by Fred Bremmer and
Steve Kroese, music by Melissa D. Binde:

http://www.roundsing.org/music/waka-waka.html

where you can listen to it in it's multipart beauty.

roundsing.org has other classics such as:

I eat my peas with honey,
I've done it all my life.
It makes the peas taste funny
But it keeps them on my knife.

--Ken

Re: Why are precomposed characters required for backward compatibility?

2002-07-11 Thread Kenneth Whistler


Dan Oscarsson said:

 NFD should not be an extension of ASCII. There are several spacing
 accents in ASCII
 that should be decomposed just like the spacing accents in ISO 8859-1
 are decomposed.
 All or none spacing accents should be decomposed.

In addition to the usage clarifications made by John Cowan and
David Hopwood, I should point out a little history here.

As of Unicode 2.0, some compatibility decompositions were still
provided for U+005E CIRCUMFLEX ACCENT, U+005F LOW LINE, and
U+0060 GRAVE ACCENT, along the lines suggested by Dan. However,
when normalization forms were being established and standardized
in the Unicode 3.0 time frame, it became obvious that these
particular compatibility decompositions would lead to trouble.

Any Unicode normalization form that would not leave ASCII values
unchanged would have been DOA (dead on arrival), because of its
potential impact on widely used syntax characters in countless
formal syntaxes. The equating of U+005F LOW LINE with a combining
low line applied to a SPACE was particularly problematical, since
LOW LINE is so widely accepted as an element of identifiers.

Because of these complications, the 3 compatibility decompositions
were withdrawn by the UTC (unanimously, if I recall correctly),
*before* the normalization forms were finally standardized.

Consistency in treatment would be nice, but consistency in
treatment of the multiply ambiguous ASCII characters of this ilk
is impossible at this point. And it would have been very, very, very,
vry bad for normalization to have allowed these three, in particular,
to have decompositions.

--Ken

Re: Variant selectors in Mongolian

2002-07-10 Thread Kenneth Whistler


Martin Heijdra asked:

 
 The statement  For example, in languages employing the Mongolian script,
 sometimes a specific variant range of glyphs is needed for a specific
 textual purpose for which the range of generic glyphs is considered
 inappropriate could be taken to mean this solution.

Correct.

 
 However, the Mongolian table is very glyph-based, and says The valid
 combinations are exhaustively listed and described in the following table.
 It seems to imply that medial dotted n is ALWAYS denoted by n-/ (as is
 undotted initial n). That is, regular ana (dotted) would be a-n-/-a,
 regular anda would a-n-d-a (undotted), irregular aNa would be encode
 a-n-a (undotted), and irregular aNda (dotted) would be a-n-/-d-a. That
 is, there would be regular formations marked with the variant selector, and
 irregular ones unmarked.

No, I don't think that is the intent for Mongolian.

 
 Which of the two cases is meant by Unicode?
 

Mongolian variants *are* very confusing, and I'm not sure what the
best way to describe them is. Part of the problem is that there is
still some tension in the UTC regarding just how to define the affect
of the variation selectors.

Position A: A variation selector selects a particular, defined glyph.

That position would, for Mongolian, tend to support your second
interpretation. However, ...

Position B: A variation selector selects a variant form of a character,
which has a distinct rendering from that specified for the character
without a variant specification.

When applied to Mongolian (or in principle any script like Mongolian),
where a character is subject to positional shaping rules, you have
to consider that character X is associated with, for example, a
*set* of glyphs X - {G1, G2, G3, G4} depending on positional contexts.
A variant of character X might be associated with a variant *set*
of glyphs, some of which could overlap, e.g. X-/ -- {G1, G2', G3', G4},
so that the glyphs for the variant might not contrast in all
positional (or other) contexts.

The reason the variation selectors were encoded in the first place
for Mongolian, I believe, was to try to preserve an Arabic-like
model, where the base character would get a character encoding,
and it would then be mapped to positionally determined glyphs.
But exceptional patterns of that positional determination required
some method of marking. The alternative which people saw would have
been to just encode all the glyphs: G1, G2, G2', G3, G3', G4, in
the above example -- and that approach would have radically departed
from the model of how Unicode should encode text. It also would
have significantly further complicated Mongolian text processing,
it seems to me, since distinct letters, in some positions, have
glyphic neutralizations. (Not that it is easy, anyhow!)

--Ken

Re: Definition of character: Exegesis of SC2 nomenclature

2002-07-10 Thread Kenneth Whistler


Martin Kochanski waxed exuberantly:

 I mention this because Unicode is the opposite of Procrustean. 

 There is no finer antidote to gloom and cynicism than leafing through the Unicode 
Standard. 

 In what other computing book could you find a phrase such as In good Latvian 
typography?

Or: The king's primary purpose was to bring Buddhism from India to Tibet ?

 
 and Character Most Resembling a Frog (this is left as an exercise for the reader). 

Telugu U+0C0A.  But then, perhaps I had an unfair start. ;-)

--Ken

Re: Variant selectors in Mongolian

2002-07-10 Thread Kenneth Whistler


John Hudson wrote:

 Mongolian variants *are* very confusing, and I'm not sure what the
 best way to describe them is. Part of the problem is that there is
 still some tension in the UTC regarding just how to define the affect
 of the variation selectors.
 
 Position A: A variation selector selects a particular, defined glyph.
 
 That position would, for Mongolian, tend to support your second
 interpretation. However, ...
 
 Position B: A variation selector selects a variant form of a character,
 which has a distinct rendering from that specified for the character
 without a variant specification.
 
 The inclusion of variant selectors in Unicode uncomfortably blurs the line 
 between character processing and glyph processing. 

True enough. But they are an attempt to keep a finger in the dike
of outright glyph encoding. If you think of the problem with Han
variants, you can see that allowing those dike leaks to crumble the
dike could result in a veritable inundation of the character encoding
with essentially useless alternate forms that would only serve to
further blur the line. Or to extend the metaphor, the ground beneath
our feet would be so softened, we'd always be trudging around hipdeep
in the mud for CJK.

 The only excuse I can 
 think of for including glyph substitution triggers in plain text is if 
 there are normative stylistic substitutions to be identified by an author 
 as a regular aspect of the writing of a given script, i.e. Ken's Position 
 A. If you are not going to specify what the variant is, what point is there 
 to including the glyph subsitution trigger in plain text, since you have no 
 idea what the outcome is going to be in any given font? 

Actually, I think Position B is a coherent one for Mongolian. The
outcome *is* specified -- it is just specified for particular positional
contexts, rather than for a single glyph per se.

X - {G1, G2, G3, G4}, where Gn is determined by positional (or other) context.

X-/ - {G1, G2', G3', G4}, where Gn is determined by positional (or other) context.

is still determinate, and not contingent on fonts. (Although, of course,
if you use fonts that don't have the glyphs G1, G2, G2', G3, G3', G4, or
software that can't do the mapping correctly, you are hosed.)

It is just more complicated, but fully as determinate as:

X - G
X-/ - G'


 The value of the 
 variant selector to the user is in knowing what the result is going to be, 
 and this means that the variant form *must* be specified. 

It is. See above.

 How else can the 
 variant selector be used to *select* a particular form? Selection implies a 
 deliberate choice, not a willingness to accept any substitution a font 
 might provide.

I agree. Although variation selectors also imply willingness to accept
fallback to default glyphs as legible alternatives, if not the
desired alternatives.

--Ken

 
 John Hudson

Strange resemblances and weird sisters

2002-07-10 Thread Kenneth Whistler


Then there is the oft-cited Character Most Resembling a
Line Break:

MALAYALAM LETTER UU (U+0D0A)

Then in Extension B there are many, many weird and wonderful
candidates for strangest CJK characters. Some of my
personal favorites include:

U+26B99
U+20137
U+20572
U+2069C
U+2696E

With such genetic defects, one would have expected such
characters to die out long ago, but Unicode has brought them
back to life.

And of course, there is always the miraculous proliferation
of turtles... ;-)

--Ken

RE: Phaistos in ConScript

2002-07-09 Thread Kenneth Whistler


Michael,

 Ken. Thanks for your response.

Hmm. I think I detect the invisible ironic smiley there.

Thanks for broadcasting my private, poke-in-the-ribs response
to you and Marco back to the public list. ;-)

 As I said, the original might (assuming a syllabic structure and 
 assigning random syllable values) well be LABUGIDANO, but when 
 reversed it might read NODAGIBULA which could be a valid linguistic 
 sequence. OK, so reading the whole text you would come up with 
 readings which wouldn't make sense, so you would have to start over 
 with a different directionality. Given the practice of the other 
 scripts in the region, I consider this unlikely given its 
 impracticality.

True enough. 

 The people who used scripts with multiple 
 directionalities did reverse the glyphs when reversing the 
 directionality. The inherent directionality of Phoenician BETH or of 
 PLUMED-HEAD or of Egyptian WN (the bunny rabbit) lends itself to the 
 use of such glyph-indicated directionality for text in general. I 
 would not assume, additionally, that the Phaistos script would always 
 be written on disks in spiral formatting. That too would be unlikely 
 and impractical, would it not?

Indeed. But what seems to be missing here is even demonstration
that we are dealing with a general use script that might be written
in other contexts. With only one instance -- and that written on
a disk in spiral formatting -- how do you know?

 I think you may be sticking your neck out rather far (to the left) 
 on this one. I am inclined to agree with Marco about the issue for 
 presentation. Why should you innovate over Godart here in this 
 *particular* instance, based on so little evidence.
 
 Because I suspect that Godart might well agree with me -- I don't 
 imagine that he ever considered this aspect of text presentation. And 
 because it makes sense given the context of other scripts in the 
 region.
 
 You could be right, but then you could be wrong, too.
 
 So could Godart! He was describing the disk, not thinking about 
 encoding and presenting it!

I'm not saying what you are doing is unreasonable -- but it is
not demonstrably uncontroversial. 

 Well that's my opinion anyway. I suppose we could try to contact 
 Godart and ask his opinion. 

Sounds like a good idea to me.

 It's not as though the CSUR is 
 normative

True enough. And if you want to get into the fray with all the
various and sundry decipherers of the disk, and teach them all
to use mirrored glyphs in LTR representations of Phaistos material,
then who's to stop you? And after all, there must be several orders
of magnitude more instances of Phaistos characters in the
secondary literature by now than there are in the primary corpus!

--Ken

 -- 
 Michael Everson *** Everson Typography *** http://www.evertype.com

Re: Phaistos Disk

2002-07-09 Thread Kenneth Whistler


Michael,

 At 10:58 -0400 2002-07-05, Patrick Rourke wrote:
 There is also the question of what kind of text it represents: is it 
 a prose text, is it a catalogue of items (the other Aegean scripts 
 tempt one to suspect this), each item represented by an ideograph, 
 etc.?
 
 Well if you look at it you find patterns and repetition in the 
 phrases divided up. It is most likely an actual text. The script is 
 probably syllabic, as syllabic scripts were common back then,

So were epic oral storytellers.

 the 
 repertoire is large enough, and the repetitive markers could well 
 represent grammatical prefixes or suffixes. One guesses, but that's 
 not a bad guess.

I'd consider it an equally good guess that the disk was a one-off
story-telling memorization aid, sketching out an epic tale and
its episodes mnemonically. The plumed head prefixes could equally
well represent major protagonists in the tale. If the obviously
recognizable patter of PLUMED-HEAD SHIELD is a prefix (or suffix)
in a set of words, then its distribution is fishy on the document --
it is ubiquitous on Side A, then starts the first word of Side B,
but then vanishes. That defective distribution casts doubt on it as
a common language affix, but does suggest a major actor in a long
tale, who dies midstory as the tale continues.

One guesses, but that's not a bad guess. ;-)

--Ken

 -- 
 Michael Everson *** Everson Typography *** http://www.evertype.com

Definition of character: Exegesis of SC2 nomenclature

2002-07-09 Thread Kenneth Whistler


One possibly interesting thing derived from the threads from hell
is the notion that the definition of character offered in the
various ISO JTC1/SC2 character encoding standards and TR's such
as the Character-Glyph Model (TR 15825) may be leading people astray
about what is appropriate to encode as a character.

Here is an attempt at an exegesis.

The standard SC2 definition of a character is:

A member of a set of elements used for the organization, control, or
representation of data.

[Quoted from ISO/IEC 10646, Clause 4 Terms and definitions, but you
can find the same definition in other SC2 standards, including each
part of ISO/IEC 8859, and in ISO/IEC 2022.]

The *reason* why SC2 chose such a strange and seemingly open-ended
definition was *not* to invite arbitrarily strange collections of
data control elements to be encoded as characters, but rather an
attempt, in a procrustean way, to get the definition to fit the
reality.

In the ISO 2022 architectural framework for character encodings,
specific character set definitions are declared as consisting of
one or more sets of graphic characters (G0 and G1 sets) and one
or more sets of control functions (C0 and C1 sets), where the
graphic characters come from registered (graphic) character encodings
and where the control functions come from registered control function
sets. The graphic character encodings are the typical character
encodings we are familiar with, of which ISO/IEC 8859-1 (Latin-1)
is a prototypical example -- a bunch of visible letters, digits,
punctuation, and symbols. The control function sets are small sets
of functions designed for the manipulation and control of characters
in various device contexts (mostly terminal hardware), and consist
of things like line advance, moving the cursor back and forwards,
indicating start and end of transmission context, marking string
delimitations, and the like. The best known of these control function
sets is defined in ISO 6429, and its C0 set is also grandfathered
into ASCII as the familiar ASCII control codes -- the same
codes that are listed in Unicode as aliases for U+..U+001F
(U+ null, U+0001, start of heading, ... U+0008 backspace,
U+0009 tab, ... etc.)

Note that the control functions are not just any imaginable set of
functions -- they are functions designed by people interested in
controlling characters on existing classes of output display devices
(terminals and teletypes, primarily). And not all terminal control
functions were defined as control functions in these sets, either.
Large classes of such functions were left up to vendor implementation,
and made use of ESC(ape) sequences for their initiation.

In the context of SC2 character encoding standards, a cover term
for character was needed which was broad enough to deal with the
existing, on the ground implementation fact that systems included
graphic characters *and* control characters mixed in character data
streams. The graphic characters were conceived of as representing
the content of text, primarily. And the then-existing usage of
control characters was primarily to organize and control the
representation of such data, by establishing line breaks, page
breaks, string or other text unit delimitations, backspacing, and
the like. Hence the committee compromise definition of character
quoted above.

That definition should be understood in the context of this history,
however. It is not legal license for intentional or unintentional
misunderstandings of the appropriate scope of character encodings,
which should be focussed on textual content, together with the
minimal additional format control specification required for text
organization.

Modern text representational practice, in a world that has
mostly abandoned character terminal display to niche and legacy uses, and
which instead uses graphic displays and image models, combined with
rasterizing of outline fonts for textual display, has essentially
made most of the ISO 6429 control functions obsolete. The Unicode
Standard only specifies the few control functions that have survived
into modern plain text handling conventions: CR, LF, FF, and tabs,
among them. On the other hand, the Unicode plain text model has
necessitated the addition of new format control characters that
were not envisioned in the terminal control function sets, or which
were organized differently for them. A good case in point are the
various Unicode bidi control format characters, which are used for
the bidirectional algorithm to override default implicit bidi
ordering for various edge cases. Those differ from the bidirectional
formatting control functions which were earlier designed for use
on designated character terminals, with fixed-size cells and fixed
line widths, for laying out visual order bidi text legibly via
control of cursor position and direction when fed a serial byte
stream to be laid out.

Note that in any case, the old control functions (aimed at serial
output devices) and the new

Re: Why are precomposed characters required for backward compatibility?

2002-07-09 Thread Kenneth Whistler


David Hopwood wrote:

 Marco Cimarosti wrote:

  BTW, they always sold me that precomposed accented letters exist in Unicode
  only because of backward compatibility with existing standards.
 
 I don't get that argument. It is not difficult to round-trip convert between
 NFD and a non-Unicode standard that uses precomposed characters. Round-trip
 convertability of strings does not imply round-trip convertability of
 individual characters, and I don't see why the latter would be necessary.

Because while it is conceptually not difficult to roundtrip convert between
legacy accented Latin characters and Unicode NFD combining character sequences,
in practice many Unicode implementations would never have gotten off the
ground if they had had to start with combining character sequences for
all Latin letters, including, in particular, the 8859 repertoires. And
the character mapping tables are considerably more complex, in practice, if
they must map 1-n, n-1, rather than 1-1. Right now, a Latin-1 to Unicode
mapping table is trivial, but if Latin-1 had not been covered with a set
of precomposed characters, the mapping would *not* have been trivial, and that
would have been a significant barrier to early Unicode adoption. And people
would *still* be complaining -- vigorously -- about the performance hit
and maintenance complexity of interoperating with 8859 and common PC
code pages.

 The only difficulty would have been if a pre-existing standard had supported
 both precomposed and decomposed encodings of the same combining mark. I don't
^^
/character
 think there are any such standards (other than Unicode as it is now), are
 there?

Not to my knowledge.

 
 (Obviously, an NFD-only Unicode would not have been an extension of ISO-8859-1.
 That wouldn't have been much of a loss; it would still have been an extension
 of US-ASCII.)
 
  If this compatibility issue didn't exist, Unicode would be like NFD.
 
 And would have been much simpler and better for it, IMHO.

It would have been better, in some respects, to treat Latin like the
complex script it is, and to end up with the same kind of clean,
by-the-principles encoding that Unicode has for Devanagari, essentially
free of equivalences and normalization difficulties. But it took years
for major platforms to get up to speed on complex script rendering,
including the relatively simple but elusive prospect of dynamic
application of diacritics to Latin letters (and/or mapping of
combining character sequences to preformed complex glyphs).

And despite the vigorous advocacy by some factions of early Unicoders
to have a consistent, decomposed Latin representation in Unicode, there
were some rather hard-headed decisions made early on (1989) that that approach 
would cripple what was then an experimental encoding. The inclusion of large
numbers of precomposed Latin letters as encoded characters was the
price for the participation of IBM, Microsoft, and the Unix vendors,
and was also the price for the possibility of alignment of Unicode with
an ISO international standard. Without paying those prices, Unicode
would not exist today, in my opinion.

--Ken

 
 - -- 
 David Hopwood [EMAIL PROTECTED]

Re: Ending the Overington [debate]

2002-07-09 Thread Kenneth Whistler


David Hopwood responded to Michael Everson:

  people just keep saying that markup exists, as if the very existence
  of XML in some way precludes single code point colour codes and
  single code point formatting codes and so on.
  
  Yes, that is right. That is entirely right.
 
 No it isn't. Duplicating functionality between character encoding and markup
 is just a Bad Thing (usually). 

Agreed. And that is part of the reason for the existence of a Unicode
Technical Report (and W3C Note) which tries to set guidelines on what is and
is not appropriate to use in the context of markup. For those who haven't seen
it, UTR #20: Unicode in XML and other Markup Languages:

http://www.unicode.org/unicode/reports/tr20/

 It is certainly not excluded a priori - as
 demonstrated by the interlinear annotation markers, stateful BiDi controls,
 and plane 14 language tags.

Correct. But just because the line between plain text content and
the kind of formatting or other presentational and/or annotational
material is often difficult to firmly draw doesn't mean that we
have open season to simply dump anything we want into character
codes.

On color, for example, there is clear consensus that encoding color
by characters is way, wy over the line into the kind of stuff which
should be handled by markup (as for setting the text color on hyperlinks)
or even by out-and-out graphics (as for display text elements).

The existence of XML (or other markup languages) does not, ipso facto,
preclude the character encoding committees from encoding single code
point colour codes. Rather, the consensus among character encoding
committees that text color is better handled by other layers of
text (and non-text) presentation and is inappropriate for encoding as
characters precludes them from making what would be utterly
controversial and nonconsensual encoding decisions.

 I see that no-one in this thread has even attempted to explain why
 duplication of functionality across layers is a bad idea, or to discuss
 what alternative models would have been possible besides plain text +
 {HTML,SGML,XML,TeX}-style markup languages. I'll try to do that in another
 post.

I'm looking forward to it.

--Ken

 
 - -- 
 David Hopwood [EMAIL PROTECTED]

Re: Multiple encodings for 1 character

2002-07-08 Thread Kenneth Whistler


Theodore wrote:

 What is going to be done about the confusion generated from 
 having multiple ways to encode the same character?
 
 For example, for filenames, OSX will encode an accented Roman 
 letter one way, while for filenames Windows will encode it the 
 other way. These kind of confusions are totally expected, if 
 Unicode will allow more than one way to encode the same 
 character.

Perhaps a stray newsfeed routed via Alpha Centauri?
This is *very* old news, indeed.

 
 This means that matching algorithm's won't work, because the 
 characters are different!
 
 Will there be some kind of recommendation of which to avoid? 
 Will the Unicode consortium make a standard to say that one of 
 these encodings is strongly not recommended, and in fact 
 depreciated?

UAX #15: Unicode Normalization Forms

http://www.unicode.org/unicode/reports/tr15/

And it is up to an implementation to specify which normalization
form it uses.

By the way, we don't depreciate Unicode encodings -- we appreciate
them. ;-)

 And what about the OS that uses this encoding? How will the 
 Unicode consortium make the newly-offending OS change it's ways?

It isn't offending, and the Unicode Consortium won't.

--Ken

Re: Whats the difference between a composite and a combining sequence?

2002-07-08 Thread Kenneth Whistler


Theodore,

 http://www.unicode.org/unicode/reports/tr15/ mentions both 
 composites and combining sequences.
 
 But it doesn't tell us the difference. I know what a combining 
 sequence is. If I didn't know what a composite was, I'd guess it 
 was the same thing as a combining sequence.

See TUS 3.0, Chapter 3, pp. 43-44

D17 Combining character sequence: a character sequence consisting of
either a base character followed by a sequence of one or more
combining characters, or a sequence of one or more combining
characters.

[e.g. A + combining-grave  U+0041, U+0300]

D18 Decomposable character: a character that is equivalent to a sequence
of one or more other characters, according to the decomposition
mappings found in the names list... It may also be known as a
precomposed character or composite character.

[e.g. A-grave, U+00C0]

--Ken

Re: FW: Inappropriate Proposals FAQ

2002-07-03 Thread Kenneth Whistler


Suzanne,

 Can people from the review committee give me some hard and fast rules for
 when something is thrown out?

As Michael Everson indicated, the answer to this is probably not.

However, perhaps the most important thing for serious script
proposers to do, to see if what they are concerned about might be
acceptable, is to consult the Roadmap:

http://www.unicode.org/roadmaps/

If a script is listed there in the Roadmap for the BMP or for Plane 1,
then people can be assured that interested members of the encoding
committees have *already* made a tentative determination that
the script is suitable for encoding, although a proposal may not
actually exist yet, and of course, there are no guarantees until the
committees actually do the work on fully filled-out formal proposals.

But if a script, like the MIIB BurgerKing cipher mentioned today,
or chess diagram notation, is missing from the Roadmap, there is probably 
a *good* reason for it not to be there, and people should think twice 
(and then again) before they start proposing it for encoding in Unicode.

--Ken

Another missing example:

The voice which shook the earth, from Chapter IV, verse 44 of
LIBER LIBERI vel LAPIDIS LAZULI ADUMBRATIO KABBALÆ ÆGYPTIORUM,
one of the Holy Books of Thelema:

http://www.nuit.org/thelema/Library/HolyBooks/LibVII.html

Disclaimer: The UTC New Scripts committee does not discriminate
among script applicants on the basis of race, color, gender,
religion, sexual orientation, national or ethnic origin, age,
disability, or veteran status. However, if they are risible,
we reserve the right to laugh. ;-)

Re: (long) Re: Chromatic font research

2002-07-02 Thread Kenneth Whistler


[*groans in the audience*]

I know, I know -- another contribution in the endless thread...

In re:
 
 The Respectfully Experiment 

 I used it as evidence that ideas about what should not be
 included in Unicode can change over a period of time as new scientific
 evidence is discovered.

Having been intimately involved in nearly all the decisions made
about what was included in Unicode over the last 13 years, and also
being formally trained as a scientist, I think I may be qualified
to dispute this conclusion.

Most of the change in ideas about what can be included in Unicode
have been the result of two types of influence:

  A. The encountering of legacy practice in preexisting character
 encodings which had to be accomodated for interoperability
 reasons. This accounts for many, if not all of the hinky little
 edge cases where Unicode appears to depart from its general
 principles for how to encode characters.

  B. The development of new processing requirements that required
 special kinds of encoded characters. This accounted for strange
 animals such as the bidi format controls, the BOM, the object
 replacement character, and the like.

There is a very narrow window of opportunity for *scientific*
evidence contributing to this -- namely, the result of graphological
analysis of previously poorly studied ancient or minority scripts,
which conceivably could turn up some obscure new principle of writing 
systems that would require Unicode to consider adding a new type of
character to accomodate it. But at this point, with Unicode having managed
to encode everything from Arabic to Mongolian to Han to Khmer..., I
consider it rather unlikely that scientific graphological study is going
to turn up many new fundamental principles here. As a scientific
*hypothesis* I think this surmise is proving to hold up rather well,
as our premier encoder of historic and minority scripts, Michael
Everson, has managed to successfully pull together encoding proposals,
based on current principles in Unicode, for dozens more scripts,
with little difficulty except for that inherent in extracting
information about rather poorly documented writing systems.

 it just seems to me that some
 extra ligature characters in the U+FB.. block would be useful.

Best practice, and near unanimous consensus in the Unicode Technical
Committee and among the correspondents on this list, would be
aligned with exactly the opposite opinion.

 In the
 light of this new evidence, I am wondering whether the decision not to
 encode any new ligatures in regular Unicode could possibly be looked at
 again.

As others have pointed out, The Respectfully Experiment did not
constitute new *evidence* of anything in this regard.

In any case, the UTC is quite unlikely to look at that decision again.

The exception that the UTC *has* considered recently was the Arabic
bismillah ligature, and the reason for doing so again was the result
of considering legacy practice. This thing exists in implemented
character encodings as a single encoded character. And furthermore,
it is used as a unitary symbol, in such a way that substituting out
an actual (long) string of Arabic letters and expecting the software
to ligate it correctly precisely in the contexts where it was being
used as a symbol, would place an unnecessary burden on both users and
on software implementations. That is *quite* different from the position
that claims that one, two, or dozens more Latin ligatures of two letters
need to be given standard Unicode encodings.

if it cannot be done or would cause great anguish and
 arguments, well, that is that, forget it.

Good idea.

--Ken

Re: ZWJ and Latin Ligatures (was Re: (long) Re: Chromatic font research)

2002-07-01 Thread Kenneth Whistler


James Kass said:

 One problem with TR28 is that it is worded so that it appears to
 be in addition to earlier guidelines. 

It is. The way this works is as follows: The original decision
about the ZWJ as request for ligation was documented in the
Unicode 3.0.1 update notice. That documentation was rolled forward
into UAX #27 (Unicode 3.1), where it was explicitly cast as text
to replace the Unicode 3.0 text on p. 318 re Controlling Ligatures,
including an update of the example table. The additional text in
UAX #28 is just that -- an *addition* to the Unicode 3.1 text,
not a replacement for it.

This will all become more apparent when we can finally publish
Unicode 4.0, which will roll all of the textual additions, once
again, into a single published document.

 This implies that the examples
 used in TR27, for one, are still valid.

They are.

  In TR27, font developers are
 urged to add things like f+ZWJ+i to existing tables where f+i
 is already present.

That recommendation still stands -- and, as John pointed out,
is being implemented by vendors.

 Another problem with TR28 is that its date is earlier than the date
 on TR27.  This suggests that TR27 is more current.

I don't understand this claim.

The date on UAX #27 is: 2001-05-16

The date on UAX #28 is: 2002-03-07

Please check that you are referring to the most recent (and only
valid) versions of each.

Otherwise, regarding the substance of this thread, I find myself
in violent agreement with John, who it seems to me is quite ably
stating the case for the current treatment as decided by the UTC.

--Ken

Re: Chromatic font research

2002-06-26 Thread Kenneth Whistler


Philipp said:

 The most obvious and simple example for glyph colours with semantic
 meaning that I can think of appears to be encoding characters for
 national flags (something that might even be considered proposable).

As *characters*? Why?

What is this bug that people catch, which induces them to consider
all things semiological to be, ipso facto, abstract characters
suitable for encoding in Unicode?

There are signs that are not characters.
There are symbols that are not characters.
There are icons that are not characters.
There are significant gestures that are not characters.
There are meaningful looks that are not characters.
There are color significances that are not characters.
There are pregnant pauses that are not characters...

 And I'm quite positive that Aztec can safely considered writing...

Aztec is clearly a language. Whether or not the Aztec codices
are appropriate to represent in plain text remains to be seen.
As yet, we have no proposal, let alone one which addresses the
potential problems in detail.

--Ken

 
   Philipp

Re: Hexadecimal characters.

2002-06-20 Thread Kenneth Whistler


  At 03:03 AM 6/20/02 -0400, Tom Finch wrote:
  I wish to propose sixteen consecutive digits for the purpose of displaying 
  hexadecimal values.  [...]  Has this been considered?
  

[David Starner]

  I seem to recall that it has. The problem is, they're just new copies of 
  old characters. An A used in hexadecimal notation is just an A. Besides the 
  problem with normalization, you have the problem with all look-alike 
  characters - people won't use them consistently. Even if this got adopted, 
  99% of time you looked at hexadecimal numbers, they would be in plain old 
  ASCII, so you don't really gain anything but confusion. It's a no-go.
 

[Tom Finch]

 I looked at the code chart and there are many 16 character sequences empty. 

That is true enough -- but the more appropriate place to look is the
BMP roadmap:

http://www.unicode.org/roadmaps/bmp-3-6.html

where you can see that many of those empty columns are already accounted
for by roadmapped allocations for living minority scripts. The BMP is
rather tight now for allocation, and it is unlikely that the committees are
going to look kindly on miscellaneous collections of dubious stuff for
encoding there.

Of course there is plenty of space in Plane 1 for just about everything,
but...

That said, David Starner has this one right. There really is no good reason
to create clones of 0..9, A..F to represent hexadecimal digits. The
existing characters do that just fine, and represent an overwhelming
legacy data representation precedent that any proposal such as Tom Finch's
would have to cope with. Introducing new characters for these would just
introduce confusion and would be unlikely to be implemented in any
useful way.

--Ken

Re: Chess symbols, ZWJ, Opentype and holly type ornaments.

2002-06-20 Thread Kenneth Whistler


 In view of the fact that some people are unwilling to let my
 ideas be discussed in this forum upon their academic merit but simply use an
 ad hominem attack almost every time I post (before many people can have the
 chance to sit down and, if they wish, have a serious read of my ideas), when
 it seems that their objection is really about the Unicode Consortium having
 included the word published in section 13.5 of chapter 13 of the Unicode
 specification, ...

Speaking here as an editor of the Unicode Standard, I do not
find the word published in section 13.5 of the book. Perhaps William
was thinking of the subheader Promotion of Private-Use Characters.
Since -- despite the explicit text that follows in that section -- some
people seem to be getting the wrong idea about private-use character
assignments as a step towards standardization, it is quite likely that
the editorial committee will be rewriting that section for Unicode 4.0,
to provide further clarification for users.

 I feel
 that the fact that I am trying to use the Unicode specification as it exists
 rather than on some nudge nudge wink wink understanding of how some people
 feel that it should be interpreted is at the root of the problem.

If parts of the Unicode Standard are unclear and are leading to
misinterpretations or incompatible interpretations of how characters
should be used -- including private-use agreements for private-use
characters, then airing those issues is certainly germane to this
discussion list.

I think what a number of people on the list have been hinting -- or
openly stating -- is that prolixity is not a virtue on an email list
when trying to convey one's ideas.

--Ken

Re: Hexadecimal characters.

2002-06-20 Thread Kenneth Whistler


Tom Finch said:

 Hmm, so representing Devanagari digits is more important 
 than hexadecimal, which is used almost more than decimal 
 on the web?

I think you may be misconstruing the purpose of the character
encoding here.

If I want to represent the hexadecimal numbers 0x60DB 0x618A
in email or in HTML hexadecimal NCR's or whatever, guess what --
I can use ASCII (or Latin-1 or Unicode) characters: 6 0 D
B 6 1 8 A -- and that is what everyone does.

It is also what is *required* by the HTML and XML standards
for the representation of hexadecimal NCR's on the web, by 
the way.

If I want to represent Devanagari digits, on the other hand,
I don't have an ASCII representation to hand -- those *require*
separate encoding, since Devanagari characters are not the
same as Latin characters or Arabic digits. So Devanagari
digits were encoded in Unicode. Simple.
   
 I know inertia is a law of the universe, but this is rediculous.  
 Hexadecimal is very important and deserves to be in Plane 0.

Umm. It *is* in Plane 0: U+0030..U+0039, U+0041..U+0046 (and
U+0061..U+0066), to be exact.

 I see a good spot in misc technical (23D--oh look hexadecimal again).

Nobody has any quarrel with the notion that hexadecimal notation
is very important in computer science -- and vital for character
encoding discussions. The issue is whether we need any separate
characters to represent hexadecimal digits, when we already have
the digits everybody has been using for decades encoded.

--Ken

Re: Chess symbols, ZWJ, Opentype and holly type ornaments.

2002-06-20 Thread Kenneth Whistler



 IOW, brevity's wit's soul.

Well-spoken, dear Polonius. But better to
Adorn the soul of wit so briefly put to us.

My liege, and madam, to expostulate
What majesty should be, what duty is,
Why day is day, night is night, and time is time.
Were nothing but to waste night, day, and time.
Therefore, since brevity is the soul of wit,
And tediousness the limbs and outward flourishes,
I will be brief. Your noble son is mad.

--the Bard

Re: Q: How many enumerated characters in Unicode?

2002-06-04 Thread Kenneth Whistler


Adam asked:

 How many characters does the current version of the Unicode Standard
 enumerate?

95,156.

 
 BTW: I think this information would be useful if it were always included in
 the summary of earch revision.

Agreed. The total was listed in Unicode 3.1 (94,140), and you could
get the number for Unicode 3.2 by adding the 1016 additions to that,
but it was an oversight not to actually list the total in the text of
Unicode 3.2.

--Ken

Re: Fixed position combining classes (Was: Combining class for Thaicharacters)

2002-06-03 Thread Kenneth Whistler


Peter,

 On 06/02/2002 05:40:05 AM Samphan Raruenrom wrote:
 
  My opinion is that they should have been simplified, but that setting
 the
  bulk of them to 0 was a mistake and creates some significant problems
  (which go a step beyond the questions you raise here).
 
 Can you elaborate on this?
 
 Given the characters
 
 : 0E35;THAI CHARACTER SARA II;Mn;0
 : 0E39;THAI CHARACTER SARA UU;Mn;103
 
 consider the sequences
 
  0e35, 0e39  vs.  0e39, 0e35 
 
 I'm guessing your first reaction will be to say that these cannot co-occur.
 That is true for the Thai language, but may not be true for other languages
 written with Thai script.

The problem, of course, is that not all eventualities could be
foreseen at the time the decisions had to be made -- when normalization
and Unicode 3.0 were looming. It might have been possible to marginally
improve on the assignments that eventually were made -- but both the
original assignment to fixed position classes, and the later simplification
of the fixed position classes, had to be made *prior* to the accumulation
of experience based on normalization being locked down in the standard.

So hindsight is 20/20. But at the time, the editors and participants
in the UTC couldn't get experts to pay enough attention to the
potential implications for Thai and other Southeast Asian scripts,
so now we are stuck with a few anomalies that people will just have
to program around, I am afraid.

 
 Now, the problem with the sequences above is that they are visually
 indistinct, meaning that they could not possibly be used by users for a
 semantically-relevant distinction. From the user's perspective, they are
 identical. Moreover, it would fit a user's expectations to have string
 comparisons to equate them (e.g. a search for  0e35, 0e39  should find a
 match if the data contains  0e39, 0e35 ). They are both
 canonically-ordered sequences, however, since U+0E35 has a combining class
 of 0. The result is that string comparisons that rely on normalisation into
 any one of the existing Unicode normalisation forms (NFD, NFC, NFKD, NFKC)
 will fail to consider these as equal.

I think you are missing a point here. It is true that if you just
take the two strings, normalize them, and then compare binary, they
will compare unequal. But for most user's expectations of equivalent
string comparisons, simply comparing binary for normalized strings
is insufficient, anyway. There may be embedded (invisible) format
control characters (ZWJ and its ilk) which should be ignored on
comparison -- but a simple binary compare won't do that. The presence
of a ZWSP might or might not be considered as indicative of a string
difference by a user, but would definitively cause the strings to compare
unequal without a corresponding visual difference. On the other hand, the
presence of some types of visual punctuation might be considered insignificant
by a user, and to be ignored, even though causing a visual difference.

The ordinary way to deal with this is to enhance the comparisons,
often in language-specific ways, to match user expectations of what
should and should not compare equal under various circumstances. And
a commonly used technology for that is one form or another of collation
tailoring for culturally expecting string comparison. If such technology
is being used to provide better results, there is no particular reason
why the language-specific tailorings for it cannot also take into account
the few anomalous cases resulting from canonical ordering of dependent
vowels in Brahmi-derived scripts in Southeast Asia, so that, under those
circumstances,  0e35, 0e39  vs.  0e39, 0e35  *would* compare equal.

 
 
 IMO, it'll be the best if we could change that. But apart from that, it
 still be useful to note what is right or wrong than not to say about it.
 After all, this happends to other (Indic) scripts too, right?
 
 There are some similar problems in at least Lao, Khmer and Myanmar. I don't
 recall for certain, but there may also be similar problems in Hebrew.

And each of the cases are fairly limited and amenable to the same
kinds of solutions, script by script, and language by language.

In any case, I think one is going to have to have some rather
specific string comparison extensions to get Khmer and Myanmar
string orderings and matchings to behave appropriately. And people
who need to make those extensions aren't going to be particularly
misled by the few anomalous instances of above or below vowel
signs having zero combining classes, which make it technically
possible to have non-canonically equivalent spellings of visually
similar combinations.

--Ken

RE: How is UTF8, UTF16 and UTF32 encoded?

2002-05-31 Thread Kenneth Whistler


Rick Cameron asked:

 The Unicode Standard 2.0 had a table in Appendix A that is, I think, just
 what you're asking for. I can't find this table in the online version of TUS
 3.0 (it's not very useful that the online index gives page numbers, when
 there's no way to map a page number to the appropriate chapter!)
 
 Does anyone know whether this table (A-3 on page A-7) is available online
 somewhere?

Table A-3 from Unicode 2.0 moved into Chapter 3 in Unicode 3.0, since
UTF-8 was itself formally incorporated into Unicode conformance at
that point. See Table 3-1 on page 47 of Unicode 3.0. (Unfortunately, access
to the table was not clearly indicated under the UTF-8 entry in the
index to Unicode 3.0 -- an oversight that will definitely be fixed for
Unicode 4.0.) You can find it online in Chapter 3 of the online text of
Unicode 3.0 at:

http://www.unicode.org/unicode/uni2book/u2.html

The surrounding text for Table 3-1 was modified for Unicode 3.1, so you
can find the Table online again in Unicode 3.1:

http://www.unicode.org/unicode/reports/tr27/

(See Article III Conformance, in that UAX.)

And finally, Unicode 3.1 added a subsidiary table of Legal UTF-8 Byte
Sequences. That table was modified slightly for Unicode 3.2, so the
most up-to-date version online can be found in Unicode 3.2:

http://www.unicode.org/unicode/reports/tr28/

(See Article III Conformance, in that UAX.)

--Ken

< 1 2 3 4 5 6 7 8 >

401 - 500 of 750 matches

Mail list logo