Re: Corrigendum #1 (UTF-8 shortest form) wording: MIME, and software interfaces specifications
From: "Doug Ewell" <[EMAIL PROTECTED]> > Philippe Verdy wrote (in rich text): > > > Due to that, an application needs to specify whever it will support > > and comply with the full ISO/IEC 10646-1:2000 character set or to the > > Unicode subset. > > ISO/IEC 10646 has reduced its range to match Unicode's, so this > distinction is obsolete. It is not obsolete: the corrigendum #1 for UTF-8 (published in Unicode 4.0) refers to ISO/IEC 10646-1:2000, not to ISO/IEC 10646:2003 which is the character repertoire which corresponds to Unicode 4.0... So that's a reference error in the version of the now normative corrigendum published in Unicode 4.0... Does it need another Corrigendum to correct this reference in the Corrigendum? Well, I still doubt that ISO/IEC 10646 has reduced its character set. It has just agreed to limit its repertoire of _standardized_ and _interchangeable_ characters to the first 17 planes so that _these_ characters can remain in sync and encoded identically in the Unicode repertoire with the same code points, but all the other planes are still present in ISO/IEC 10646, some of them being still allocated to PUAs that don't have equivalents in Unicode, but they are still valid within UTF-8 encoded data and also still conforming to ISO/IEC 10646 (even if they are illegal for use in Unicode 4.0, these sequences are not ill-formed like non shortest forms now forbidden in both standards).
Handy table of combining character classes
Here's a little table of the combining classes, showing the value, the number of characters in the class, and a handy name (typically the one used in the Unicode Standard, or a CODE POINT NAME if there is only one; sometimes of my own invention). Class Count Name = = 0 589 Class Zero 1 16 Overlays 7 7 Nuktas 8 2 Japanese Sound Marks 9 16 Viramas 10 1 HEBREW POINT SHEVA 11 1 HEBREW POINT HATAF SEGOL 12 1 HEBREW POINT HATAF PATAH 13 1 HEBREW POINT HATAF QAMATS 14 1 HEBREW POINT HIRIQ 15 1 HEBREW POINT TSERE 16 1 HEBREW POINT SEGOL 17 1 HEBREW POINT PATAH 18 1 HEBREW POINT QAMATS 19 1 HEBREW POINT HOLAM 20 1 HEBREW POINT QUBUTS 21 1 HEBREW POINT DAGESH OR MAPIQ 22 1 HEBREW POINT METEG 23 1 HEBREW POINT RAFE 24 1 HEBREW POINT SHIN DOT 25 1 HEBREW POINT SIN DOT 26 1 HEBREW POINT JUDEO-SPANISH VARIKA 27 1 ARABIC FATHATAN 28 1 ARABIC DAMMATAN 29 1 ARABIC KASRATAN 30 1 ARABIC FATHA 31 1 ARABIC DAMMA 32 1 ARABIC KASRA 33 1 ARABIC SHADDA 34 1 ARABIC SUKUN 35 1 ARABIC LETTER SUPERSCRIPT ALEF 36 1 SYRIAC LETTER SUPERSCRIPT ALAPH 84 1 TELUGU LENGTH MARK 91 1 TELUGU AI LENGTH MARK 103 2 Thai Sara U/UU 107 4 Thai Tone Marks 118 2 Lao U/UU 122 4 Lao Tone Marks 129 1 TIBETAN VOWEL SIGN AA 130 6 Various Tibetan Vowels 132 1 TIBETAN VOWEL SIGN U 202 4 Below Attached 216 9 Above Right Attached 218 1 Below Left 220 81 Below 222 4 Below Right 224 2 Left 226 1 Right 228 3 Above Left 230 147 Above 232 3 Above Right 233 2 Double Below 234 4 Double Above 240 1 COMBINING GREEK YPOGEGRAMMENI -- John Cowan <[EMAIL PROTECTED]> http://www.ccil.org/~cowan "One time I called in to the central system and started working on a big thick 'sed' and 'awk' heavy duty data bashing script. One of the geologists came by, looked over my shoulder and said 'Oh, that happens to me too. Try hanging up and phoning in again.'" --Beverly Erlebacher - End forwarded message - -- "How they ever reached any conclusion at all[EMAIL PROTECTED]> is starkly unknowable to the human mind." http://www.reutershealth.com --"Backstage Lensman", Randall Garrett http://www.ccil.org/~cowan
Re: Corrigendum #1 (UTF-8 shortest form) wording: MIME, and software interfaces specifications
Philippe Verdy wrote (in rich text): > Due to that, an application needs to specify whever it will support > and comply with the full ISO/IEC 10646-1:2000 character set or to the > Unicode subset. ISO/IEC 10646 has reduced its range to match Unicode's, so this distinction is obsolete. More later. Maybe. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Re: elided base character or obliterated character (was: Hebrew composition model, with cantillation marks)
Andrew C. West wrote: > And given that most CJK fonts aim to cover both Chinese and Japanese > characters, how would the square missing ideograph glyph and the > Japanese geta mark be differentiated ? By means of variant selectors ? In the Windows world at least, most fonts that include any CJK characters either: (1) are clearly aimed at Chinese, like SimSun, or (2) are clearly aimed at Japanese, like Mincho, or (3) aim to cover as much of Unicode as possible, like Arial Unicode MS and Code2000, and thus really can't be considered "CJK fonts" per se. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Corrigendum #1 (UTF-8 shortest form) wording: MIME, and software interfaces specifications
I see this sentence in the last paragraph: The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set; those five- and six-byte sequences are illegal for the use of UTF-8 *AS A TRANSFORMATION OF _UNICODE_ CHARACTERS*. (...) The global interpretation of this paragraph is thus defining Unicode as a subset of ISO/IEC 10646-1:2000, for the 17 first planes where Unicode and ISO/IEC 10646-1:2000 will be fully interoperable. So it does NOT say that the use of five- and six-byte sequences are illegal for the use of UTF-8 *AS A TRANSFORMATION OF _ISO/IEC 10646-1:2000_ CHARACTERS*. Due to that, an application needs to specify whever it will support and comply with the full ISO/IEC 10646-1:2000 character set or to the Unicode subset. As both standards specify "UTF-8" as the name of the transformation, and the transformation is in fact defined in ISO/IEC 10646-1:2000, it seems that there's no restriction on UTF-8 sequences lengths, just restrictions about their use to encode characters in the Unicode subset. This leaves open the opportunity to encode *non-Unicode* characters of *ISO/IEC 10646-1:2000*, i.e. characters outside its 17 first planes and that must not be interpreted as valid Unicode characters, but can still be interpreted as valid ISO/IEC 10646-1:2000 characters. Then later, we have this final sentence: (...) ISO/IEC 10646 does not allow mapping of unpaired surrogates, nor U+FFFE and U+ (but it does allow other noncharacters). Here also this is a difference: non-characters are explicitly said to be *non-Unicode* characters (i.e. must not be interpreted as valid Unicode characters, not even the replacement character), but can still be interpreted as valid ISO-10646-1:2000 if ISO 10646-1:2000 allows it (and it seems to allow it in UTF-8 transformed strings). Here also an application will need to specify which character set it supports. If the application chooses to support and conform to ISO-10646-1:2000, there's no guarantee that it will conform to Unicode. As there's a requirement to not interpret non-Unicode characters as Unicode characters, an application that conforms to Unicode cannot then remap valid ISO/IEC 10646-1:2000 characters with REPLACEMENT CHARACTER to make the encoded text be interoperable with Unicode. It chooses to do so, it uses an algorithm which is invalid in the scope of Unicode (so it's not a Unicode folding), but is valid and conforming in the ISO/IEC 10646-1:2000 universe, where it will be considered a fully compliant ISO/IEC 10646-1:2000 folding transformation. When I say "folding" in the last sentence, it really has the same meaning as in Unicode, as it does not preserve the semantic of the string and looses information: such folding operations must then be clearly specified to be done out of scope of the Unicode standard, and is not by itself a identity UTF transformation. Such application would then have a ISO/IEC 10646-1 input interface, but not a compliant Unicode input interface, even though its folded output may conform to Unicode. Shouldn't then texts coded with strict Unicode conformance be labelled differently than ISO-IEC 10646-1 even if they share the same transformation, simply because they don't formally share the same character set? I mean here cases like the: charset="UTF-8" pseudo-attribute in XML declarations, or the: ; charset=UTF-8 option in MIME "text/*" content-types (in RFC822 messages, or in HTTP headers), or the: in HTML documents... Here the "charset" is not specifying really a character set, but only the transformation format. This is probably not a problem, as long as the MIME content-type standard clearly states that the "UTF-8" label must only be used to mean the Unicode character set and not the ISO/IEC 10646-1:2000 character set or its followers (I think that such thing is specified for the interpretation of the charset pseudo-attribute of XML declarations). However, if such explicit wording is missing in the MIME definition of the charset option, how can we specify on an interface the effective charset used by a datafile ? Note that I don't say this is a problem in the Unicode standard itself or in the ISO/IEC 10646-1:2000 standard, but a problem specific to the MIME standard where there's possibly an ambiguity about the implied character set... What do you think? Shouldn't Unicode ask to MIME to publish a revized RFC for this case? If they don't want and in fact were refering to the ISO/IEC 10646-1 standard, then we have no choice: the MIME charset="UTF-8" option indicates ONLY conformanace to ISO/IEC 10646-1, but NOT conformance to Unicode, and we need to register another option to indicate the strict Unicode conformance. Why not then registering this MIME option "subset=Unicode/4
Re: Tamil conjunct consonants (was: Encoding Tamil SRI)
At 10:34 + 2003-11-07, [EMAIL PROTECTED] wrote: I'm still concerned about the SHRII ligature encoding, though. Of course, it makes sense to treat the ligature as a conjunct of SHA + RA + II, but since SA + RA + II seems to have been the "official" way to encode the ligature -- the proposed change will break existing implementations. That's the price of disunification. But it's the right thing to do. It might be best to add the new SHA character without changing the existing SHRII encoding (SA + RA + II). That would be incorrect, however. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Merging combining classes
At 19:52 -0500 2003-11-06, Jim Allan wrote: It really isn't necessary to remind anyone that the Netherlands objected to adding the Romanian characters. In any case COMBINING CEDILLA and COMBINING COMMA BELOW were characters in Unicode 1.0. But Romanians are still frustrated because most fonts distributed as part of computer operating systems or otherwise available do not support these characters. Apple does a good job. They are in many of their shipping fonts. Since there is no linguistic tradition in any language for _t_ with a cedilla shape beneath, most modern fonts display an undercomma beneath U+0162, U+0163 instead of a cedilla shape. "Most"? By the way I believe the Times Atlas of the World uses t-cedilla in transcriptions of Arabic or Ethiopic names. I forget which. There are actually three conflicting uses, since Gagauz traditionally uses a cedilla shape under _c_ an undercomma beneath _t_ and a symbol halfway between the two under _s_. See http://www.unicode.org/mail-arch/unicode-ml/y2002-m09/0199.html You overstate the case. "Traditionally" is not indicated in that posting, but only "anecdotally" with regard to some references consulted. -- Michael Everson * * Everson Typography * * http://www.evertype.com
RE: Encoding Tamil SRI
At 14:58 + 2003-11-06, [EMAIL PROTECTED] wrote: > Tamil SHRI [sic] can't be represented correctly in Unicode yet. It will not be able to be correctly until U+0BB6 is encoded. It was accepted for ballot by WG2 and UTC but has to go through the process now. Proposal for adding SHA at U+0BB6 can be seen at: http://wwwold.dkuug.dk/JTC1/SC2/WG2/docs/n2617 In the document, it is noted that the current practice for encoding SHRI in Unicode is SA+VIRAMA+RA. Does this mean that existing documents/data are incorrect or will become incorrect once SHA is formally approved? I think that it should. SHA is being disunified from SA in this instance. -- Michael Everson * * Everson Typography * * http://www.evertype.com
RE: Tamil conjunct consonants (was: Encoding Tamil SRI)
Peter Jacobi wrote: > IMHO this doesn't fit well actual Tamil use and raises a lot > of practical problems. > > Either there must be an accepted list of these ligatures (but lists of > archaic usage tend to grow), or one is bound to put a preemptive ZWNJ > after every SHA VIRAMA in modern use, to prevent conjunct consonant > forming. > > If this archaic ligature problems extends to other grantha > consonants, even more preemptive ZWNJs are necessary for > contempary Tamil. "Archaic" ligatures are supposed to be present only in a font designed for reproducing an "archaic" look. Those fonts should not be used for typesetting modern Tamil. There is nothing special with Tamil here: this would be true for any other script. E.g., if you typeset this English e-mail with a Fraktur OpenType font many "archaic" ligatures might appear, such as "ch" or "ss". Moreover, unexpected contextual forms could appear: e.g., the "s" in "special" could look very different from the "s" in "ligatures" ("long s" vs. "short s"). ZWNJ's etc. should be inserted only in special cases, e.g. when the presence or absence of a ligature would change the meaning of the word, or anyway affect the meaning of the text. _ Marco
Re: elided base character or obliterated character (was: Hebrew composition model, with cantillation marks)
On Thu, 6 Nov 2003 12:51:53 -0500, John Cowan wrote: > > IIRC we talked about this a year or so ago, and kicked around the idea that > the Chinese square could be treated as a glyph variant of U+3013 GETA MARK, > which looks quite different but symbolizes the same thing. I suspect that few Chinese would be happy to see a well-known, easily-recognised and frequently-used symbol relegated to a glyph variant of a Japanese symbol that is unknown amd unrecognised in China. There would be puzzled faces if the geta mark appeared within Chinese text if the "wrong" font was selected. And given that most CJK fonts aim to cover both Chinese and Japanese characters, how would the square missing ideograph glyph and the Japanese geta mark be differentiated ? By means of variant selectors ? If you were going to use variant selectors to differentiate the two glyphs (and neither glyph is a variant of the other for that matter), then you might as well encode it seperately, and be done with it ! The CJK Symbols and Punctuation block is largely Japanocentric, and I do not think that it would hurt to add a few Chinese-specific symbols and marks - after all if there's room in Unicode for wheelchairs, hot beverages, umbrellas with raindrops, hot springs, etc. etc., you would think that room could be made for the Chinese missing ideograph symbol which is used with such great frequency in modern reprints of old texts. Probably worthwhile making a proposal and letting UTC/WG2 decide. Andrew
Re: Tamil conjunct consonants (was: Encoding Tamil SRI)
. Peter Jacobi wrote, > So, which codepoint sequence will imply the disjoint form and > which will imply the ligated form? If 'Indic unification' still > holds, the conjunct form always is the default and the disjoint > form needs ZWNJ. > > IMHO this doesn't fit well actual Tamil use and raises a lot of > practical problems. > > Either there must be an accepted list of these ligatures (but > lists of archaic usage tend to grow), or one is bound to put a > preemptive ZWNJ after every SHA VIRAMA in modern use, to prevent > conjunct consonant forming. > > If this archaic ligature problems extends to other grantha > consonants, even more preemptive ZWNJs are necessary for > contempary Tamil. The Unicode string U+0BB2, U+0BC8 will display differently, depending on which font is used. (லை) Code2000 will display an old-fashioned ligature glyph, Latha will show a more modern alternative, and TabAvarangal2 ( http://www.geocities.com/avarangal ) will render the string in a proposed Tamil script-reform style. Yet, the underlying encoded character string is constant. It may be possible and desirable to treat these archaic ligature forms similarly. Fonts designed for modern Tamil simply won't include these archaic ligature glyphs, so it shouldn't be necessary to insert ZWNJs all over the place in existing files. Anyone seeking to reproduce a Tamil classic would need to specify an appropriate font which includes the archaic ligatures. Users whose systems lacked the appropriate font would still be able to read the document, however. IMHO, it's important to preserve options for users to explicitly control ligation in plain text. With these archaic Tamil ligatures, an author *may* elect to insert ZWNJs and other appropriate formatting characters to preserve such distinctions where desired. I'm still concerned about the SHRII ligature encoding, though. Of course, it makes sense to treat the ligature as a conjunct of SHA + RA + II, but since SA + RA + II seems to have been the "official" way to encode the ligature -- the proposed change will break existing implementations. It might be best to add the new SHA character without changing the existing SHRII encoding (SA + RA + II). Best regards, James Kass .
Tamil conjunct consonants (was: Encoding Tamil SRI)
Hi James, Michael, Marco, All, Thank you for providing the references which seem to settle the SRI /SHRI issue: http://www.unicode.org/alloc/Pipeline.html http://wwwold.dkuug.dk/JTC1/SC2/WG2/docs/n2617 Reading the references and James' other reply: > Perhaps this could be stated as '... Tamil doesn't form many conjunct > consonants'? A more general issue asked for attention. See this snippet from N2617: > Proposed character SHA may also form various other ligatures in combination > with MA, YA, RA, and VA. > However, these ligatures are archaic and are not widely recognized. Contemporary > publications only use disjoint forms. So, which codepoint sequence will imply the disjoint form and which will imply the ligated form? If 'Indic unification' still holds, the conjunct form always is the default and the disjoint form needs ZWNJ. IMHO this doesn't fit well actual Tamil use and raises a lot of practical problems. Either there must be an accepted list of these ligatures (but lists of archaic usage tend to grow), or one is bound to put a preemptive ZWNJ after every SHA VIRAMA in modern use, to prevent conjunct consonant forming. If this archaic ligature problems extends to other grantha consonants, even more preemptive ZWNJs are necessary for contempary Tamil. Regards, Peter Jacobi -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++