Re: Corrigendum #1 (UTF-8 shortest form) wording: MIME, and software interfaces specifications

2003-11-07 Thread Philippe Verdy
From: "Doug Ewell" <[EMAIL PROTECTED]>

> Philippe Verdy wrote (in rich text):
>
> > Due to that, an application needs to specify whever it will support
> > and comply with the full ISO/IEC 10646-1:2000 character set or to the
> > Unicode subset.
>
> ISO/IEC 10646 has reduced its range to match Unicode's, so this
> distinction is obsolete.

It is not obsolete: the corrigendum #1 for UTF-8 (published in Unicode 4.0)
refers to ISO/IEC 10646-1:2000, not to ISO/IEC 10646:2003 which is the
character repertoire which corresponds to Unicode 4.0...

So that's a reference error in the version of the now normative corrigendum
published in Unicode 4.0...

Does it need another Corrigendum to correct this reference in the
Corrigendum?

Well, I still doubt that ISO/IEC 10646 has reduced its character set. It has
just agreed to limit its repertoire of _standardized_ and _interchangeable_
characters to the first 17 planes so that _these_ characters can remain in
sync and encoded identically in the Unicode repertoire with the same code
points, but all the other planes are still present in ISO/IEC 10646, some of
them being still allocated to PUAs that don't have equivalents in Unicode,
but they are still valid within UTF-8 encoded data and also still conforming
to ISO/IEC 10646 (even if they are illegal for use in Unicode 4.0, these
sequences are not ill-formed like non shortest forms now forbidden in both
standards).




Handy table of combining character classes

2003-11-07 Thread John Cowan
Here's a little table of the combining classes, showing the value, the
number of characters in the class, and a handy name (typically the one
used in the Unicode Standard, or a CODE POINT NAME if there is only one;
sometimes of my own invention).

Class   Count   Name
=   =   
0   589 Class Zero
1   16  Overlays
7   7   Nuktas
8   2   Japanese Sound Marks
9   16  Viramas
10  1   HEBREW POINT SHEVA
11  1   HEBREW POINT HATAF SEGOL
12  1   HEBREW POINT HATAF PATAH
13  1   HEBREW POINT HATAF QAMATS
14  1   HEBREW POINT HIRIQ
15  1   HEBREW POINT TSERE
16  1   HEBREW POINT SEGOL
17  1   HEBREW POINT PATAH
18  1   HEBREW POINT QAMATS
19  1   HEBREW POINT HOLAM
20  1   HEBREW POINT QUBUTS
21  1   HEBREW POINT DAGESH OR MAPIQ
22  1   HEBREW POINT METEG
23  1   HEBREW POINT RAFE
24  1   HEBREW POINT SHIN DOT
25  1   HEBREW POINT SIN DOT
26  1   HEBREW POINT JUDEO-SPANISH VARIKA
27  1   ARABIC FATHATAN
28  1   ARABIC DAMMATAN
29  1   ARABIC KASRATAN
30  1   ARABIC FATHA
31  1   ARABIC DAMMA
32  1   ARABIC KASRA
33  1   ARABIC SHADDA
34  1   ARABIC SUKUN
35  1   ARABIC LETTER SUPERSCRIPT ALEF
36  1   SYRIAC LETTER SUPERSCRIPT ALAPH
84  1   TELUGU LENGTH MARK
91  1   TELUGU AI LENGTH MARK
103 2   Thai Sara U/UU
107 4   Thai Tone Marks
118 2   Lao U/UU
122 4   Lao Tone Marks
129 1   TIBETAN VOWEL SIGN AA
130 6   Various Tibetan Vowels
132 1   TIBETAN VOWEL SIGN U
202 4   Below Attached
216 9   Above Right Attached
218 1   Below Left
220 81  Below
222 4   Below Right
224 2   Left
226 1   Right
228 3   Above Left
230 147 Above
232 3   Above Right
233 2   Double Below
234 4   Double Above
240 1   COMBINING GREEK YPOGEGRAMMENI

-- 
John Cowan   <[EMAIL PROTECTED]>   http://www.ccil.org/~cowan
"One time I called in to the central system and started working on a big
thick 'sed' and 'awk' heavy duty data bashing script.  One of the geologists
came by, looked over my shoulder and said 'Oh, that happens to me too.
Try hanging up and phoning in again.'"  --Beverly Erlebacher

- End forwarded message -

-- 
"How they ever reached any conclusion at all[EMAIL PROTECTED]>
is starkly unknowable to the human mind."   http://www.reutershealth.com
--"Backstage Lensman", Randall Garrett  http://www.ccil.org/~cowan



Re: Corrigendum #1 (UTF-8 shortest form) wording: MIME, and software interfaces specifications

2003-11-07 Thread Doug Ewell
Philippe Verdy wrote (in rich text):

> Due to that, an application needs to specify whever it will support
> and comply with the full ISO/IEC 10646-1:2000 character set or to the
> Unicode subset.

ISO/IEC 10646 has reduced its range to match Unicode's, so this
distinction is obsolete.

More later.  Maybe.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




Re: elided base character or obliterated character (was: Hebrew composition model, with cantillation marks)

2003-11-07 Thread Doug Ewell
Andrew C. West  wrote:

> And given that most CJK fonts aim to cover both Chinese and Japanese
> characters, how would the square missing ideograph glyph and the
> Japanese geta mark be differentiated ? By means of variant selectors ?

In the Windows world at least, most fonts that include any CJK
characters either:

(1) are clearly aimed at Chinese, like SimSun, or
(2) are clearly aimed at Japanese, like Mincho, or
(3) aim to cover as much of Unicode as possible, like Arial Unicode MS
and Code2000, and thus really can't be considered "CJK fonts" per se.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




Corrigendum #1 (UTF-8 shortest form) wording: MIME, and software interfaces specifications

2003-11-07 Thread Philippe Verdy



I see this sentence in the 
last paragraph:
 

  
  The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 
  also allows for the use of five- and six-byte sequences to encode characters 
  that are outside the range of the Unicode character set; those five- and 
  six-byte sequences are illegal for the use of UTF-8 
  *AS A TRANSFORMATION OF _UNICODE_ 
  CHARACTERS*. (...)
  
The global interpretation of this paragraph is thus 
defining Unicode as a subset of ISO/IEC 10646-1:2000, for the 17 first planes 
where Unicode and ISO/IEC 10646-1:2000 will be fully interoperable. So it does 
NOT say that the use of five- and six-byte sequences are illegal for the use of 
UTF-8 *AS A TRANSFORMATION OF _ISO/IEC 10646-1:2000_ 
CHARACTERS*.
 
Due to that, an application needs to specify whever it 
will support and comply with the full ISO/IEC 10646-1:2000 character 
set or to the Unicode subset. As both standards specify "UTF-8" as the name of 
the transformation, and the transformation is in fact defined in ISO/IEC 
10646-1:2000, it seems that there's no restriction on UTF-8 sequences lengths, 
just restrictions about their use to encode characters in the Unicode 
subset.
 
This leaves open the opportunity to encode 
*non-Unicode* characters of *ISO/IEC 
10646-1:2000*, i.e. characters outside its 17 first planes and 
that must not be interpreted as valid Unicode characters, but can still be 
interpreted as valid ISO/IEC 10646-1:2000 characters.
 
Then later, we have this final sentence:

  
  (...) ISO/IEC 10646 does not allow mapping of unpaired 
  surrogates, nor U+FFFE and U+ (but it does allow other 
  noncharacters).
  
Here also this is a difference: non-characters are explicitly 
said to be *non-Unicode* characters (i.e. must not be 
interpreted as valid Unicode characters, not even the replacement character), 
but can still be interpreted as valid ISO-10646-1:2000 if ISO 10646-1:2000 
allows it (and it seems to allow it in UTF-8 transformed strings).  Here 
also an application will need to specify which character set it supports. If the 
application chooses to support and conform to ISO-10646-1:2000, there's no 
guarantee that it will conform to Unicode.
 
As there's a requirement to not interpret non-Unicode 
characters as Unicode characters, an application that conforms to Unicode cannot 
then remap valid ISO/IEC 10646-1:2000 characters with REPLACEMENT CHARACTER to 
make the encoded text be interoperable with Unicode. It chooses to do so, it 
uses an algorithm which is invalid in the scope of Unicode (so it's not a 
Unicode folding), but is valid and conforming in the ISO/IEC 10646-1:2000 
universe, where it will be considered a fully compliant ISO/IEC 10646-1:2000 
folding transformation.
 
When I say "folding" in the last sentence, it really has the 
same meaning as in Unicode, as it does not preserve the semantic of 
the string and looses information: such folding operations must then be clearly 
specified to be done out of scope of the Unicode standard, and is not by itself 
a identity UTF transformation. Such application would then have a ISO/IEC 
10646-1 input interface, but not a compliant Unicode input interface, even 
though its folded output may conform to Unicode.
 
Shouldn't then texts coded with strict Unicode conformance be 
labelled differently than ISO-IEC 10646-1 even if they share the same 
transformation, simply because they don't formally share the same character 
set?
 
I mean here cases like the:
    charset="UTF-8"
pseudo-attribute in XML declarations, or the:
    ; charset=UTF-8
option in MIME "text/*" content-types (in RFC822 messages, or 
in HTTP headers), or the:
    
in HTML documents... Here the "charset" is not specifying 
really a character set, but only the transformation format.
 
This is probably not a problem, as long as the MIME 
content-type standard clearly states that the "UTF-8" label must only be used to 
mean the Unicode character set and not the ISO/IEC 10646-1:2000 character set or 
its followers (I think that such thing is specified for the interpretation of 
the charset pseudo-attribute of XML declarations).
 
However, if such explicit wording is missing in the MIME 
definition of the charset option, how can we specify on an interface the 
effective charset used by a datafile ? Note that I don't say this is a problem 
in the Unicode standard itself or in the ISO/IEC 10646-1:2000 standard, but a 
problem specific to the MIME standard where there's possibly an ambiguity about 
the implied character set... What do you think?
 
Shouldn't Unicode ask to MIME to publish a revized RFC for 
this case? If they don't want and in fact were refering to the ISO/IEC 10646-1 
standard, then we have no choice: the MIME charset="UTF-8" option indicates ONLY 
conformanace to ISO/IEC 10646-1, but NOT conformance to Unicode, and we need to 
register another option to indicate the strict Unicode conformance.
 
Why not then registering this MIME option 
"subset=Unicode/4

Re: Tamil conjunct consonants (was: Encoding Tamil SRI)

2003-11-07 Thread Michael Everson
At 10:34 + 2003-11-07, [EMAIL PROTECTED] wrote:

I'm still concerned about the SHRII ligature encoding, though.
Of course, it makes sense to treat the ligature as a conjunct
of SHA + RA + II, but since SA + RA + II seems to have been
the "official" way to encode the ligature -- the proposed
change will break existing implementations.
That's the price of disunification. But it's the right thing to do.

It might be best to add the new SHA character without changing
the existing SHRII encoding (SA + RA + II).
That would be incorrect, however.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com


Re: Merging combining classes

2003-11-07 Thread Michael Everson
At 19:52 -0500 2003-11-06, Jim Allan wrote:

It really isn't necessary to remind anyone that the Netherlands 
objected to adding the Romanian characters. In any case COMBINING 
CEDILLA and COMBINING COMMA BELOW were characters in Unicode 1.0.

But Romanians are still frustrated because most fonts distributed as 
part of computer operating systems or otherwise available do not 
support these characters.
Apple does a good job. They are in many of their shipping fonts.

Since there is no linguistic tradition in any language for _t_ with 
a cedilla shape beneath, most modern fonts display an undercomma 
beneath U+0162, U+0163 instead of a cedilla shape.
"Most"? By the way I believe the Times Atlas of the World uses 
t-cedilla in transcriptions of Arabic or Ethiopic names. I forget 
which.

There are actually three conflicting uses, since Gagauz 
traditionally uses a cedilla shape under _c_ an undercomma beneath 
_t_ and a symbol halfway between the two under _s_. See 
http://www.unicode.org/mail-arch/unicode-ml/y2002-m09/0199.html
You overstate the case. "Traditionally" is not indicated in that 
posting, but only "anecdotally" with regard to some references 
consulted.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



RE: Encoding Tamil SRI

2003-11-07 Thread Michael Everson
At 14:58 + 2003-11-06, [EMAIL PROTECTED] wrote:

 > Tamil SHRI [sic] can't be represented correctly in Unicode yet. It
 will not be able to be correctly until U+0BB6 is encoded. It was
 accepted for ballot by WG2 and UTC but has to go through the process
 now.
Proposal for adding SHA at U+0BB6 can be seen at:
http://wwwold.dkuug.dk/JTC1/SC2/WG2/docs/n2617
In the document, it is noted that the current practice for encoding
SHRI in Unicode is SA+VIRAMA+RA.  Does this mean that existing
documents/data are incorrect or will become incorrect once SHA is
formally approved?
I think that it should. SHA is being disunified from SA in this instance.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com


RE: Tamil conjunct consonants (was: Encoding Tamil SRI)

2003-11-07 Thread Marco Cimarosti
Peter Jacobi wrote:
> IMHO this doesn't fit well actual Tamil use and raises a lot 
> of practical problems. 
> 
> Either there must be an accepted list of these ligatures (but lists of
> archaic usage tend to grow), or one is bound to put a preemptive ZWNJ
> after every SHA VIRAMA in modern use, to prevent conjunct consonant
> forming.
> 
> If this archaic ligature problems extends to other grantha 
> consonants, even more preemptive ZWNJs are necessary for
> contempary Tamil.

"Archaic" ligatures are supposed to be present only in a font designed for
reproducing an "archaic" look. Those fonts should not be used for
typesetting modern Tamil.

There is nothing special with Tamil here: this would be true for any other
script.

E.g., if you typeset this English e-mail with a Fraktur OpenType font many
"archaic" ligatures might appear, such as "ch" or "ss". Moreover, unexpected
contextual forms could appear: e.g., the "s" in "special" could look very
different from the "s" in "ligatures" ("long s" vs. "short s").

ZWNJ's etc. should be inserted only in special cases, e.g. when the presence
or absence of a ligature would change the meaning of the word, or anyway
affect the meaning of the text.

_ Marco




Re: elided base character or obliterated character (was: Hebrew composition model, with cantillation marks)

2003-11-07 Thread Andrew C. West
On Thu, 6 Nov 2003 12:51:53 -0500, John Cowan wrote:
> 
> IIRC we talked about this a year or so ago, and kicked around the idea that
> the Chinese square could be treated as a glyph variant of U+3013 GETA MARK,
> which looks quite different but symbolizes the same thing.

I suspect that few Chinese would be happy to see a well-known, easily-recognised
and frequently-used symbol relegated to a glyph variant of a Japanese symbol
that is unknown amd unrecognised in China. There would be puzzled faces if the
geta mark appeared within Chinese text if the "wrong" font was selected. And
given that most CJK fonts aim to cover both Chinese and Japanese characters, how
would the square missing ideograph glyph and the Japanese geta mark be
differentiated ? By means of variant selectors ? If you were going to use
variant selectors to differentiate the two glyphs (and neither glyph is a
variant of the other for that matter), then you might as well encode it
seperately, and be done with it !

The CJK Symbols and Punctuation block is largely Japanocentric, and I do not
think that it would hurt to add a few Chinese-specific symbols and marks - after
all if there's room in Unicode for wheelchairs, hot beverages, umbrellas with
raindrops, hot springs, etc. etc., you would think that room could be made for
the Chinese missing ideograph symbol which is used with such great frequency in
modern reprints of old texts. Probably worthwhile making a proposal and letting
UTC/WG2 decide.

Andrew



Re: Tamil conjunct consonants (was: Encoding Tamil SRI)

2003-11-07 Thread jameskass
.
Peter Jacobi wrote,

> So, which codepoint sequence will imply the disjoint form and 
> which will imply the ligated form? If 'Indic unification' still 
> holds, the conjunct form always is the default and the disjoint 
> form needs ZWNJ. 
> 
> IMHO this doesn't fit well actual Tamil use and raises a lot of 
> practical problems.
> 
> Either there must be an accepted list of these ligatures (but 
> lists of archaic usage tend to grow), or one is bound to put a 
> preemptive ZWNJ after every SHA VIRAMA in modern use, to prevent 
> conjunct consonant forming.
> 
> If this archaic ligature problems extends to other grantha 
> consonants, even more preemptive ZWNJs are necessary for 
> contempary Tamil.

The Unicode string U+0BB2, U+0BC8 will display differently, depending
on which font is used.  (லை)

Code2000 will display an old-fashioned ligature glyph, Latha will
show a more modern alternative, and TabAvarangal2
( http://www.geocities.com/avarangal )
will render the string in a proposed Tamil script-reform style.

Yet, the underlying encoded character string is constant.

It may be possible and desirable to treat these archaic ligature
forms similarly.  Fonts designed for modern Tamil simply won't
include these archaic ligature glyphs, so it shouldn't be necessary
to insert ZWNJs all over the place in existing files.

Anyone seeking to reproduce a Tamil classic would need to specify
an appropriate font which includes the archaic ligatures.  Users
whose systems lacked the appropriate font would still be able
to read the document, however.

IMHO, it's important to preserve options for users to explicitly
control ligation in plain text.  With these archaic Tamil ligatures,
an author *may* elect to insert ZWNJs and other appropriate
formatting characters to preserve such distinctions where
desired.

I'm still concerned about the SHRII ligature encoding, though.
Of course, it makes sense to treat the ligature as a conjunct
of SHA + RA + II, but since SA + RA + II seems to have been
the "official" way to encode the ligature -- the proposed
change will break existing implementations.

It might be best to add the new SHA character without changing
the existing SHRII encoding (SA + RA + II).

Best regards,

James Kass
.



Tamil conjunct consonants (was: Encoding Tamil SRI)

2003-11-07 Thread Peter Jacobi
Hi James, Michael, Marco, All,

Thank you for providing the references which seem to settle the
SRI /SHRI issue:

http://www.unicode.org/alloc/Pipeline.html
http://wwwold.dkuug.dk/JTC1/SC2/WG2/docs/n2617

Reading the references and James' other reply:

> Perhaps this could be stated as '... Tamil doesn't form many conjunct
> consonants'?

A more general issue asked for attention. See this snippet from N2617:

> Proposed character SHA may also form various other ligatures in
combination 
> with MA, YA, RA, and VA.
> However, these ligatures are archaic and are not widely recognized.
Contemporary 
> publications only use disjoint forms.

So, which codepoint sequence will imply the disjoint form and which will
imply
the ligated form? If 'Indic unification' still holds, the conjunct form
always is the
default and the disjoint form needs ZWNJ. 

IMHO this doesn't fit well actual Tamil use and raises a lot of practical
problems. 

Either there must be an accepted list of these ligatures (but lists of
archaic usage 
tend to grow), or one is bound to put a preemptive ZWNJ after every SHA
VIRAMA 
in modern use, to prevent conjunct consonant forming.

If this archaic ligature problems extends to other grantha consonants, even
more
preemptive ZWNJs are necessary for contempary Tamil.

Regards,
Peter Jacobi


-- 
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService

Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse für Mail, Message, More! +++