Re: Small Latin Letter m with Macron
Christoph Päper asked: I recently learned in news:de.etc.sprache.deutsch that there has been a tradition (in handwritten text more than in print) of writing mm as only one m with a macron above. I can't find any such character in Unicode, just U+1E3F and U+1E41. You could of course build something similar with m+U+0305 to resemble the look, but that won't become mm (just m or m¯) after a conversion to e.g. ISO-8859-1. Should such a character be added to Unicode (or did I miss it)? Neither. Handwritten forms and arbitrary manuscript abbreviations should not be encoded as characters. The text should just be represented as m + m. Then, if you wish to *render* such text in a font which mimics this style of handwriting and uses such abbreviations, then you would need the font to ligate mm sequences into a *glyph* showing an m with an overbar. To do otherwise, either representing the plain text content as m, combining-macron or with a newly encoded m-macron character, would just distort the *content* of the text, which is what the character encoding should be about. If and only if an m-macron became a part of the accepted, general orthography of German would it make sense to start representing textual content in terms of such a character. And in such a hypothetical future, you would use m, combining-macron, because it already exists in Unicode, and there is no point to encoding another canonically equivalant precomposed character for that sequence. --Ken
Re: U+2047 double question mark collation
Vadim, I have a problem with creating collation key for U+2047 (double question mark). Explicit collation keys for this symbol is absent in allkeys.txt. allkeys.txt in the current version of the Unicode Collation Algorithm is based on the Unicode *3.1* repertoire. This can be seen in the references section in UTS #10, where the version is explicitly listed as allkeys-3.1.1.txt. U+2047 is a character added to Unicode Version *3.2*. In UnicodeData.txt this symbol have compatibility decomposition map. 2047: ... :compat 003F 003F: ... True. Based on this and as defined in UTR #10 Unicode Collation Algoriphm this symbol must have these collation keys: 003F [*024E.0020.0004] 003F [*024E.0020.0004] But in CollationTest_NON_IGNORABLE.txt assumes that symbol have implicit collation key [FBC0.0020.0002] [A047..]. CollationTest_NON_IGNORABLE.txt is also based on the Unicode 3.1 repertoire. For a Unicode 3.1 implementation of collation, U+2047 is a reserved code point. This situation, where the allkeys.txt table is slightly out-of-synch (behind) the ongoing repertoire additions to the Unicode Standard, is a known problem we are working on. The Unicode Technical Committee has mandated that the repertoire for the allkeys.txt table be updated directly to the Unicode 4.0 repertoire, as soon after the release of Unicode 4.0 as possible. We are trying to do this more or less simultaneously this time, but there may be a small delay, given the scope of the upcoming Unicode 4.0 release. In the meantime, if you need to deal with character additions for Unicode 3.2 for collation, then you need to handle them in terms of tailorings from the current allkeys.txt table. --Ken
RE: h in Greek epigraphy
BTW, the introductory sentence on page 360 of TUS 3 seems strange. It says that IPA includes basic Latin letters and a number of Latin letters from other blocks and then puts four Greek letters in the list! Should this be changed to something like IPA includes basic Latin letters and a number of other Latin and Greek letters? Noted for fix by the editors. --Ken
Re: h in Greek epigraphy
My first answer to my correspondent was just use Roman h. That would be my suggestion, too. It is available now -- it matches current practice, and requires no further action. A program that was sorting text, or trying to determine what script a word was written in, would get confused by hε̄γεμο̄ν. As for sorting -- if you are sorting epigraphical Greek, you likely need customized tables, anyway. Just add h and treat it appropriately. As for determination of script, you need to ask yourself, for what purpose. If this is something like regular expression matching, then again, it doesn't matter so much -- you would just attempt to match against strings containing letters of the Greek script + h, and you'd get what you expect. Would this justify a proposal for Greek small letter epigraphical h? I don't think so. Not unless you can demonstrate that this really is a distinct character, as opposed to a special usage of the already existing Latin h -- which is what it seems to be. --Ken David
RE: Precomposed Tibetan
Peter Lofting asked: Presumedly the present proposal of 900+ stacks is a maturation of the same system. And the claim for universality is based on it being able to typeset everything they have published to-date. It is based on the Founders system software, as Michael mentioned. The question is whether that list of texts is representative of the full literary and linguistic corpus It is not. or is only a sub-set? It is. The Chinese delegation admitted that the collection of stacks was aimed at modern Tibetan use and would not cover literary Tibetan. This means that in practice systems based on the current Founders system technology would be restricted in their coverage, and that Unicode-based systems would have to deal with *both* the precomposed stacks and with the rest of Tibetan, leading to Hangul-like normalization nightmares. Could the Chinese be asked to provide detailed information on this system and the texts that it has published so we can get an idea of the domain that their stack set covers? They were asked some questions during the meeting. The correct way to proceed now is to provide national body feedback on their proposal. Such feedback can, of course, contain such questions regarding the intended scope of coverage of the repertoire in the proposal. --Ken Peter Lofting
RE: Precomposed Tibetan
Marco commented: Another key point, IMHO, is verifying the following claim contained in the proposal document: Tibetan BrdaRten characters are structure-stable characters widely used in education, publication, classics documentation including Tibetan medicine. The electronic data containing BrdaRten characters are estimated beyond billions. Once the Tibetan BrdaRten characters are encoded ^ in BMP, many current systems supporting ISO/IEC10646 will enable Tibetan processing without major modification. Therefore, the international standard ^^ Tibetan BrdaRten characters will speed up the standardization and digitalization of Tibetan information, keep the consistency of implementation level of Tibetan and other scripts, develop the Tibetan culture and make the Tibetan culture resources shared by the world. [BTW, billions of what!?] The Chinese delegation at the WG2 meeting agreed with a restatement of this as gigabytes of data. Exactly what kind of data, they did not say, but in principle that could consist of a few medium-size databases. It almost certainly does not consist of billions of *documents*. I'd propose the following: 1. Find all the available technical details about this BrdaRten encoding. One additional detail for people. The BrdaRten stacks are currently implemented, in the Founders System software in Tibet, as an extension to GB 2312. 2. Come up with a precise machine-readable mapping file between BrdaRten encoding to *decomposed* Unicode Tibetan, possibly accompanied by a sample conversion application. Reasons: (a) to make it easy to migrate BrdaRten legacy data to Unicode; (b) to easily update existing BrdaRten applications to export Unicode text; (c) to easily retrofit new Unicode applications to import BrdaRten text. See the key words without major modification above. If the BrdaRten stacks were encoded in Unicode, they would automatically become part of GB 18030 (because of the UTF-like nature of that strange standard). However, the catch is that the actual code points for Unicode/10646 are not predictable or controllable by the Chinese NB. That means that the final code points in GB 18030 are also not predictable -- and almost certainly are not the same as those used by the current GB 2312 extension in Tibet. And *that* means that the current characters ... estimated beyond billions will have to be migrated to a new encoding, anyway, once the systems are updated to GB 18030. That is the reason for the quibble word major in the phrase above. All the data will be reencoded, but the transition GB 2312 + Tibetan extension == GB 18030 containing Tibetan extension is viewed as just a mapping and not a major system modification. The alternative (and even scarier) prospect is that the existing GB 2312 Tibetan extension code points would be forced as is into a new version of GB 18030, invalidating the mapping for the existing code points, and creating a completely new version of GB 18030 that would have to be supported as a different code page from the existing GB 18030. This would start us down the road to a indefinite number of distinct GB 18030 mappings, probably not properly labeled in interchange, with large numbers of interoperability problems predictable (and likely to dwarf the JIS yen sign/backslash kinds of problems). The reason this prospect is even thinkable is that any existing implementation of the BrdaRten stacks in a GB 2312 extension would surely be using 2-byte character encodings, and a transition to 4-byte GB 18030 character encodings would likely disrupt the existing implementations significantly. The question for Unicoders is whether introduction of significant normalization problems into Tibetan (for everyone) is a worthwhile tradeoff for this claimed legacy ease of transition for one system, when it is clear that all existing legacy data using these precomposed stacks is going to have to either be reencoded anyway (or surrounded by migration filters for new systems). --Ken
Re: Localized names of character ranges
Doug, seconding a suggestion by Marco, wrote: I agree that a multilingual Unicode glossary should be assembled (possibly as a volunteer project) and officially endorsed by the Unicode Consortium, so users and vendors will be on common terminological ground. In general, I favor such an activity, although at the moment it would have to be something done by outside volunteers, as the UTC editorial committee doesn't have the bandwidth now (in the crunch for Unicode 4.0) to undertake more open-ended responsibilities. My caution, however, is that the terminology used by the Unicode Standard is still evolving -- as witness the ongoing arguments about some of the terminology related to the character encoding model. The glossary in Unicode 4.0 will be substantially revised in some of the key points having a bearing on the Unicode encoding model. And as more content is added to the standard, additional terms keep accumulating in the glossary as well. And it will be some time before the online glossary can be completely synched back up with the Unicode 4.0 glossary. Once people start maintaining a multilingual glossary based on the online glossary (or supplemented from other sources), the burden of maintenance will escalate rapidly for any change introduced to terminology. These things only work if there is an ongoing institutional commitment to maintenance and updates. Otherwise all the translated versions start to get out-of-synch quickly, both with the English original and with each other. This can lead to dangerous misunderstandings among people who assume that their own translated version is accurate. So if anyone wants to undertake such an effort, don't forget to provide for ongoing maintenance and for the fact that eager volunteers tend to drop like flies when repeatedly forced to update their work at irregular intervals. --Ken
Re: Default properties for PUA characters???
Christian Wittern asked: Leaving aside the red light that flashed in my head on the notion of the W3C recommending PUA (for interchange?), I was wondering about the notion of PUA characters being by Unicode defaults treated as ideographs. Is there a canonical reference for this? Just wondering, Many Unicode character properties are actually code point properties. They must partition the entire Unicode codespace, so that an API can return a meaningful value for any code point, including PUA and unassigned code points, not just for assigned characters. Because of this, the Unicode Standard now has a concept of a default property value, which applies in code points which are not otherwise given an explicit value for that property. In the case of PUA characters, the Unicode Character Database gives them all the same properties. Some of the most important of those properties are: gc=Co (general category = Private_Use) ccc=0 (combining class = 0, i.e. Not_Reordered) bc=L(bidi class = strong Left_To_Right) sc=Zyyy (script = Common) lb=XX (line break = Unknown) ea=A(east asian width = Ambiguous) For ideographs, which also all have the same properties, the relevant, corresponding properties are: gc=Lo (general category = Other_Letter) ccc=0 (combining class = 0, i.e. Not_Reordered) bc=L(bidi class = strong Left_To_Right) sc=Hani (script = Han) lb=ID (line break = Ideographic) ea=W(east asian width = Wide) Thus, while in some respects the PUA characters are, by default, like ideographs (they are all base characters and are treated as left-to-right for bidi purposes), in other respects, their properties differ. In particular, with respect to line-breaking, UAX #14 currently states for lb=XX: The default behavior for [XX] is identical to class AL. [i.e. alphabetic characters] ... In addition, implementations can override or tailor this default behavior, e.g. by assigning characters the property ID or another class, if that is likely to give the correct default behavior for their users, or use other means to determine the correct behavior. For example, one implementation might treat any private use character in ideographic context as ID, while another implementation might support a method for assigning specific properties to specific definitions of private use characters. The details of such use of private use charaters are outside the scope of this standard. So I'd say that the XML Core WG has got the situation only partially correct for Unicode PUA characters. --Ken
Re: mixed-script writing systems
Dean Snyder asked: ... What it comes down to is the fact that for historic scripts in particular, there are no defined criteria that would enable us to simply *discover* the right answer regarding the identity of scripts. To a certain extent, the encoding committees need to make arbitrary partitions of historic alphabets through time and space, reflecting scholarly praxis as far as feasible, and then live with the results. At least this procedure makes it *possible* to represent the texts reliably, once the scripts and their variants have been standardized. What are the criteria used to make these arbitrary partitions? I have to return to my statement above. There are no defined criteria -- at least not in the sense of some formally defined set of criteria which could be objectively applied by graphologists to come up with the right answer. As for many issues, particularly regarding ancient systems, there are a lot of historical contingencies which intervene -- what attestations managed to survive and what kinds of material they consist of. And equally important may be the particular twists and turns that analysis of the materials took. Writing systems which require long, problematical, and in some cases uncertain decipherments may end up with different encoding needs than systems where the nature of the units may not be at issue. And answers may depend on the nature of the historic *successors* of the attestations as well, since boundaries between systems and the nature of the encoding decided upon may then be influenced by the encoding of the successor systems. What is determinative of scholarly praxis? Consensus among the expert practitioners. The character encoding committees make an effort to ensure that there is some evidence of such consensus, when expert opinion is available. Otherwise there would be little point in attempting to standardize character encoding. In the case of Sumero-Akkadian, it seems to me that there was, for example, some evident consensus among experts that it made sense to specify that as a script for encoding, leaving open the question of where to draw the boundary for early Sumerian on the one hand, and differentiating later adaptations of cuneiform which were clearly not Sumero-Akkadian per se, such as Ugaritic. But if that is *not* the consensus among Assyriologists, then any determination as to where to draw the boundaries would have to await the emergence of such consensus. And would not some or all of the examples I give above be governed by such criteria? I think your examples were seeking formal logical criteria. But my point is that writing systems and scripts are both holistic systems and fuzzy around the edges. The best way to find them is not to seek formal logical criteria, but instead to find *experts* who know them and ask them to point them out. If I am a novice wondering through a new forest, and need to tell the trees in the forest apart (as opposed to the forest from the trees :-) ), it is much easier *and* more accurate to get an expert to tell me, That's a madrone, that's a bay laurel, that's a coastal live oak, that's a big leaf maple, ... than it is to ask the expert (or anyone else) to draw up a foolproof set of taxonomic criteria whereby I can deal with all the edge cases (including the hybrids). --Ken
Re: ISO 10646, Unicode The FAQ (Bengali Khanda Ta)
Rick investigated, and came up with: In a specific case, Andy asked about Khanda Ta, and pointed to a WG2 resolution that contradicts the Unicode FAQ on the same topic. I looked up a paper listing an action item as follows, taken from document http://anubis.dkuug.dk/JTC1/SC2/WG2/docs/M40ActionItems.pdf which are the action items from meeting #40 of WG2; the decision was from meeting #39 in October 2000: Resolution M39.11 (Request from Bangladesh): In response to the request from Bangladesh Standards and Testing Institution in document N2261 for adding KHANDATA character to 10646, WG2 instructs its convener to communicate to the BSTI: a. that the requested character can be encoded in 10646 using the following combining sequence: Bengali TA (U+09A4 ) + Bengali Virama (U+09CD) + ZWNJ (U+200C) + Following Character(s), to be able to separate the KHANDATA from forming a conjunct with the Following Character(s). Therefore, their proposal is not accepted. b. our understanding that BDS 1520: 2000 completely replaces the BDS 1520: 1997. That does indeed give a different answer than the Unicode FAQ. I wonder if anyone else knows whether the text of 10646 contains any mention of Khanda Ta, and if so, what it says. It does not mention Khanda Ta. And I guess it's time to open that old CBS (character BS) mailbag to track this sucker down. Resolution M39.11 dates from the WG2 discussion of September 20, 2000 (at the WG2 meeting in Vouliagmeni, Greece). It was agenda item 7.12 at that meeting, Proposal to synchronize Bengali standard with 10646, during which the question came up about what is this KHANDATA thing in Bengali BDS 1520:2000 standard anyway, and should it be encoded as a separate character, as it was (at code point 0xBA) in BDS 1520:2000. For details of the discussion, see the WG2 meeting minutes, online in WG2 N2253. The upshot of the initial discussion was that Michael Everson was tasked with an action item, to wit: Michael Everson to contact BSTI (email id, name etc. are in the cover letter) - a query was sent out to Unicode expert's list also. The response received to the query to the Unicode list on September 20 from a Mr. Abdul Malik seemed to answer the question of what the KHANDATA was. Anyone who wants to can dig it out of the Unicode email archives: X-UML-Sequence: 16066 (2000-09-20 16:22:21 GMT). But the relevant portions of the email were: quote - Original Message - From: Michael Everson [EMAIL PROTECTED] To: Unicode List [EMAIL PROTECTED] Sent: Wednesday, September 20, 2000 10:30 AM Subject: Request about Bengali/Bangla BDS 1520:2000 contains a BANGLA LETTER KHANDATA and it has been proposed for addition to the UCS. I am at the WG2 meetings in Athens where the character is being discussed, but we don't know how to evaluate it. A representative of the Bangladesh Standards and Testing Institution (the instigator of the proposal) should be better placed to answering these questions than me, anyway... What is this character and how is it used? KhandoTa is a form of the letter Ta. It is the form Ta takes when it has no inherent vowel. It occurs when final and medial, but never the initial letter of a word. It is equivalent to Ta virama. Ta with a visible virama is only needed for illustrative purposes, kandaTa being used in its place in all Bengali words, except when it forms a conjunct form. For example in a standard without KhandaTa, there are two different forms the sequence Ta Virama Ma need to take i.e. khandoTa_Ma or the Ta/Ma_conjunct_form. As BSD1520:2000 does not include any ligation control characters other than Virama, it is necessary to include KhandaTa as a separate letter to make the two previously mentioned forms. Another question, is does BDS 1520:2000 completely replace BDS 1520:1997, or is the old standard still valid (and being implemented)? BDS 1520:1997 is based on a font encoding. It is the standard currently used in the products of Proshika Computer Systems and AdarshaBangla Technologies Inc. It is also the encoding used in many web sites. BDS 1520:2000 is a complete replacement, being based on the ISO/IEC10646 character encoding model. AFAIK it is yet to receive a real world implementation. BDS 1520:2000 seems immature as it does not include any encoding principles or rendering rules, for example, how is Bengali zophola to be formed? Is it formed from Ya or YYa? What are the implications for interoperability between this standard and ISCII standards? As BDS 1520 does not currently have an encoding model to refer to, one can not say. e.g. to form Ka_halant Ka: in Unicode :- Ka virama ZWNJ Ka In ISCII :- KA Virama Virama Ka In BDS :- ?? Regards Abdul /quote It was on the basis of *this* feedback from a Bengali expert on the Unicode list, reported back by Michael Everson to the WG2 meeting, that WG2 drafted a resolution responding to the request by BSTI expressed in
Re: Lowercase numerals
Doug Ewell answered: Thomas Lotze thomas dot lotze at uni dash jena dot de wrote: Why is it that while there are both uppercase and lowercase roman numerals in the Unicode character set (in the Number Forms range), no lowercase arabic numerals (old-style or text figures) are encoded? If they are regarded as presentation forms of the uppercase numerals (in the Basic Latin range), why is this not the case for their roman counterparts? Because oldstyle numerals aren't really lowercase in the same sense as small letters (though some typographers think of them that way; see [1]). They're just glyph variants of the uniform-height lining numerals, so yeah, it's a character-glyph thing. And to complete the answer for Thomas, the Roman numerals are based on Latin letters, which *do* have upper/lowercase distinctions, unlike digits. The compatibility Roman numerals in the Unicode Standard (U+2160..U+217F) are derived from East Asian standards which separately encoded upper- and lowercase forms, so would have been required to be separately encoded just for compatibility anyway. --Ken
Re: mixed-script writing systems
Andrew West wrote: On Mon, 18 Nov 2002 02:34:18 -0800 (PST), Kenneth Whistler wrote: In point of fact, people for centuries have been borrowing back and forth between Latin, Greek, and Cyrillic in particular, so that in some respects LGC is a kind of metascript and should be treated as such. Latin, Greek, Cyrillic and Runic even (cf. Latin letters Thorn and Wynn). Point taken. And don't forget Old Italic, which is now encoded as well. Gothic is a good example of a mixed-script writing system, Not really -- a good example, that is. composed of a mixture of Latin, Greek and Runic letters. There is a Gothicness about the graphic forms of the glyphs of the Gothic alphabet, but IMHO this variation from standard (but what is standard in 4th century terms ?) Latin, Greek and Runic letters should be dealt with at the font level. It isn't particularly helpful to go there, since it doesn't fit all that well as merely a font variant of Latin or Greek or Runic. Certainly it *could* be done that way, but for this particular case, the committees were convinced that simply laying out Gothic as a distinct script was more practical. As it stands now, the Gothic bible can be correctly and unambiguously represented in Unicode, using the Gothic script as defined. Not to have encoded the Gothic script would have left us still arguing about which letters from which script to use and how Gothic fonts should be encoded. Nevertheless, Gothic has been encoded in Unicode, and this may provide an unwelcome precedent for encoding other mixed-script writing systems. What you are getting at is the complicated problem of sorting out all the historical connections between various related alphabets and trying to sift them into categories which make sense as scripts and categories which are simply font variants within a script. For modern scripts this is less of a problem, since we have modern practice and typography to rely on to help make the distinctions. For *historic* scripts, on the other hand, it is murkier. Old Italic is a good case in point. It *could* have been treated as another archaic outlier of Greek. The problem with that is that it would have added a few more archaic letters which never show up in modern Greek fonts, and it would have forced distinct archaic fonts to be able to represent Old Italic text reliably. Old Italic texts don't get rendered with a modern Greek font -- it would look ridiculous. Because of this usage pattern, it made sense to the committees to coalesce the various southern Old Italic alphabets (Oscan, Umbrian, Messapian, etc.) into a script which would incorporate all the required letters for those alphabets, as *opposed* to Latin or to Greek per se. It is likely that a similar decision will be taken in the future to account for the Alpine alphabets of northern Italy, which are intermediate between Italic and Runic alphabets. What it comes down to is the fact that for historic scripts in particular, there are no defined criteria that would enable us to simply *discover* the right answer regarding the identity of scripts. To a certain extent, the encoding committees need to make arbitrary partitions of historic alphabets through time and space, reflecting scholarly praxis as far as feasible, and then live with the results. At least this procedure makes it *possible* to represent the texts reliably, once the scripts and their variants have been standardized. What about the now-defunct Zhuang alphabet (used between 1955 and 1981 in PRC) that was composed of a cumbersome mixture of Latin, Cyrillic and IPA letters ? Should the letters of this alphabet be encoded separately in Zhuang block, Check the standard: U+0185 LATIN SMALL LETTER TONE SIX U+019C LATIN CAPITAL LETTER TURNED M U+01A8 LATIN SMALL LETTER TONE TWO etc. This issue was decided already in 1989. or is it simply the fact that the borrowed letters do not exhibit any distinctive Zhuangness in their graphic form that precludes their being encoded separately in the same way that Gothic is ? (Or is it perhaps a Eurocentric bias in Unicode ?) It is getting rather tiresome to have Eurocentric bias brandished as a disparagement of an encoding standard, 87% of whose content consists of Han or Hangul characters, and whose maintaining committees are busy finalizing the addition of Limbu, Tai Le, Osmanya, Ugaritic Cuneiform, and Linear B. The UTC met just last week, and voted to start the process of adding the Karoshti script. Yeah, definitely a Eurocentric bias detectable there in that collection of additions. --Ken Andrew
Re: The result of the Plane 14 tag characters review
James Kass said: How do these differences apply to Unicode plain text and the Plane 14 tags? For example, it was noted that the ideographic full stop is centered in Chinese text but sits on the baseline (and isn't centered) in Japanese text. This claim about ideographic periods is untrue. Chinese typography uses both conventions. Older, traditional typography (but still already Western-adapted in using horizontal layout) uses the centered ideographic full stops (e.g., 1971 dictionary published in Taipei). Modern typography uses the baseline, left-set ideographic full stops (e.g., 1997 simplified Chinese dictionary published in Beijing, 2002 simplified Chinese newspaper published in Burlingame, California!). It is a matter of typographic style and historic period, *not* of language. *Really* traditional classical Chinese text doesn't use an ideographic full stop at all. Typical material might be set vertically, with left sidelining serving the highlighting function that bolding or italics would do in Latin text, and with furigana-style punctuation dropped in annotationally on the right side of the vertical lines of text. [Just to make things difficult, *that* Chinese, while still Chinese, is clearly a distinct language from modern (Mandarin) Chinese, as distinct from it as Chaucer's English is from modern (American) English.] Without a plain text method of distinguishing the writing system for a run of text, a plain text file wouldn't be able to be correctly displayed if it had both Japanese and Chinese text. Of course it would. Go to any Japanese newspaper. There is no required change of typographic style when Chinese names and placenames are mentioned in the context of Japanese articles about China. Go to any Chinese newspaper. There is no required change of typographic style when Japanese names and placenames are mentioned in the context of Chinese articles about Japan. These is completely comparable to the fact that my local English-language newspaper doesn't need a German language tag to write Gerhard Schroeder. --Ken (Ideographic variants notwithstanding.) Best regards, James Kass.
Re: The result of the Plane 14 tag characters review
Michael Everson asked: At 13:37 -0800 2002-11-18, Kenneth Whistler wrote: Go to any Japanese newspaper. There is no required change of typographic style when Chinese names and placenames are mentioned in the context of Japanese articles about China. Go to any Chinese newspaper. There is no required change of typographic style when Japanese names and placenames are mentioned in the context of Chinese articles about Japan. Just to be sure: this means that when a Japanese newspaper it uses the glyphs its readers prefer for Chinese names, not glyphs which Chinese readers may prefer? Yes. For obvious reasons. Does this extend to the Simplified/Traditional instance, so that if a Chinese name has the word for horse in it, it uses the Japanese glyph for horse,not either the S or T version of the glyph (assuming for the sake of argument that both occur and that both are different from the preferred Japanese glyph)? Yes. Example: The once president of the ROC, known in English as Chiang Kai-shek, has a surname which shows several variants. Traditional Chinese: U+8523 Simplified Chinese: U+848B Japanese prefers a different, traditional simplification of the glyph for U+848B. You can see the difference in the Unicode 3.0 book charts if you look up U+848B in the charts (p. 693), and then look up the corresponding 0x8FD3 in the Shift-JIS Index (p. 931). In a Japanese newspaper, the Japanese-style of U+848B will be present in the font. If the source is from a simplified Chinese rendition of Chiang Kai-shek, then the Japanese presentation will simply be the same character, Japanese style. If the source were from a traditional Chinese rendition, then the Japanese presentation would also represent a respelling of the name from U+8523 to U+848B (comparable to Schröder -- Schroeder) to get it to use a character for which the appropriate Japanese presentational form is available. In any case, once the correct spelling is settled on, there is no *stylistic* variation from the rest of the text for the Chinese name embedded in Japanese text . It is clearly recognized in text as an alien, i.e., non-Japanese name, and no attempt would be made to give it a Japanese name reading, but that is merely by virtue of the reader's recognition that U+848B, U+4ECB, U+77F3 is a famous Chinese person -- and would be sounded out as Shoo Kaiseki (not *Makomo Sukeishi or some other putative Japanese name). --Ken
Re: The result of the Plane 14 tag characters review
These is completely comparable to the fact that my local English-language newspaper doesn't need a German language tag to write Gerhard Schroeder. How about a multilingual newspaper? What of a multilingual newspaper? Take a hypothetical instance of a German/English newspaper, which presented all the news twice -- once in German, and again in English. So the German side says, for example: Nach einem 19 Monate dauernden Stillstand im Nahost-Friedensprozeß und einem zähen achttägigen Verhandlungsmarathon bei Washington haben sich Israels Ministerpräsident Netanjahu und der Vorsitzende der palästinensischen Autonomiebehörde, Arafat, in einer langem Sitzung in der Nacht zum Freitag auf ein Interimsabkommen über ,,Land für Sicherheit`` geeignigt... Then the English side would say: After a 19 month pause in the Middle East peace process... etc. In such a case, it would make sense to tag the *entire* German text as German, and the *entire* English text as English (and it would probably be done so in terms of markup in any case). But it would make no particular sense to start digging into the material and tagging Washington as English (although it is) and Israel and Netanjahu as Hebrew (although they are) and Arafat as Arabic (although it is). Embedded quotations of untranslated material, if they occur, perhaps. Well, Chinese and Japanese work the same way. You do whatever adaptation of the names are required for your local language, and then you present them as expected to the reader of *that* language. So, in the above example, Netanjahu for the German reader, Netanyahu for the English reader -- but in neither case presented in the original Hebrew. (In fact, for German, you will also commonly find it spelled Netanyahu -- but you won't find it in Hebrew.) --Ken
Re: mixed-script writing systems
So, the question is this: Should we say that this writing system is completely Latin (keeping the norm that orthographic writing systems use a single script) and apply the principle of unification -- across languages but not across scripts -- to imply that we need to encode new characters, Latin delta, Latin theta and Latin yeru? Or, do we say that this writing system is only *mostly* Latin-based, and that it mixes in a few characters from other scripts? If everyone can hold off on the Kurdish rhetoric for the moment, it should be clear that such mixed orthographies as Peter has shown in Wakhi are best handled by simply using the characters that are already encoded, rather than cloning more and more characters into Latin, Greek, and Cyrillic to deal with the artificial constraint that would claim that any LGC-based alphabet *must* consist only of a single script. In point of fact, people for centuries have been borrowing back and forth between Latin, Greek, and Cyrillic in particular, so that in some respects LGC is a kind of metascript and should be treated as such. Note that we will run across many other examples of such cross-script LGC letter borrowings in various oddball orthographies. One I happen to know about is the publication by Morris Swadesh of extensive texts of Wakashan languages using Cyrillic che (U+0447) in the midst of otherwise Latin letters for what most Americanists would currently use Latin c-hacek (U+010D) instead. It isn't doing anyone any favors to keep cloning such cross-script borrowings into the character encoding standard, *unless* there is strong evidence of script-specific adaptation of the letters after their borrowing. The handling of Latin Q in the otherwise Cyrillic Kurdish alphabet is what makes it the marginal case it is and argues for encoding of a separate Cyrillic Q. I do not, however, believe that such arguments apply to cases such as this Wakhi instance, unless Peter or someone else could demonstrate specific Latin-scriptfication of the borrowed letters in the orthography. --Ken
Re: The result of the plane 14 tag characters review.
William Overington asked: As the Unicode Consortium invited public comments on the possible deprecation of plane 14 tag characters, will the Unicode Consortium be making a prompt public statement of the result of the review as soon as the present meeting of the Unicode Technical Committee is completed, or even earlier if the decision of the Unicode Technical Committee has already been finalized? *the Unicode Consortium spokesman steps up to the press conference podium* *the press surges forward eagerly* *flashbulbs start to pop* Ahem... The Unicode Technical Committee would like to announce that no formal decision has been taken regarding the deprecation of Plane 14 language tag characters. The period for public review of this issue will be extended until February 14, 2003. *hands are waved vigorously* *microphones are shoved forward with loud questions* I'm sorry... No..., No..., I have no further response at this time. *the Unicode Consortium spokesman retires hurriedly, followed closely by two burly bodyguards*
Re: In defense of Plane 14 language tags (long)
David Hopwood said: Note that if deprecation implies no longer treating these characters as ignorables, It would not. The only character *property* implication that deprecation of Plane 14 language tags (or any other characters) would have is the requirement that they gain the Deprecated property. (See PropList.txt in the Unicode Character Database.) then that causes new software that sees existing data using plane 14 tags to break (to some extent; probably not fatally). OTOH, if deprecation does not imply treating plane 14 tags as ignorables, then nothing is gained: the complexity of filtering is still there, but the characters can't actually be used. Deprecation in the Unicode Standard does not mean that characters cannot actually be used. In fact, many generic implementations, such as low-level libraries which report character properties, will continue to implement them, precisely because higher-level processes will need to know that the code points in question *are* deprecated (along with whatever other properties they may have). What deprecation in the Unicode Standard means, basically, is that a particular character or set of characters is noted as a horrible encoding mistake, and that any implementer in their right mind would look to use the suggested alternatives as a better way to approach whatever misguided goal the deprecated characters were originally intended to achieve. As Asmus put it: Since we can't remove them, we would deprecate them, so that countless legions of implementers can forget worrying about a feature deemed desirable but never put into practice. --Ken P.S. I have to agree with John Hudson, Asmus, and others that the issue is not about the usefulness of language tagging per se, but whether Plane 14 language tag characters themselves, as currently defined, are an appropriate mechanism for indicating language tags in Unicode (supposedly) plain text. Doug's contribution would be more convincing if it dropped away the irrelevancies about whether the *function* of language tagging is useful and focussed completely on the appropriateness of this *particular* set of characters on Plane 14 as opposed to any other means of conveying the same distinctions.
Re: Names for UTF-8 with and without BOM
Perhaps it is time to think of three other words starting with B, O, M that make a better explanation.) Bollixed Operational Muddle ;-) --Ken
RE: New Charakter Proposal
Dominikus Scherkl replied to Markus: My other suggestion (and the main reason to call the proposed charakter source failure indicator symbol (SFIS)) was intended especaly for mall-formed utf-8 input that has overlong encodings. This is a special, custom form of error handling - why assign a character for it? Converting from and to utf-8 is an all-day topic, very important for all applications handling with unicode. So it is a special case, but very common. Therefore it would be nice to have a standardized - application independend - error handling for it. Also it is a mechanism useful for many other charsets beeing converted do unicode. I've got to agree with Markus here. Among other things, encoding a character which means conversion failure occurred here and then embedding it in converted text is just a generic and not very informative way of *representing* a conversion failure. The actual error handling would still end up being up to the application, every bit as much as what an application does today with a U+FFFD in Unicode text is application-specific. Adding this kind of character would then also complicate the task of people trying to figure out how to write convertors, since they would then be scratching their heads to distinguish between cases which warrant use of U+FFFD and those which warrant this new SFIS instead. Maybe the distinction seems clear to you, but I suspect that in practice people will become confused about the distinctions, and there will be troubling edge cases. In the particular case of UTF-8, I would consider such a mechanism nothing more than an attempted endrun around the tightened definition of UTF-8. It provides another path whereby ill-formed UTF-8 could get converted and then end up being interpreted by some process that doesn't know the difference. In other words, it carries the risk of reintroducing the security issue that we've been trying to get legislated away, by finding a way to make it o.k. to interpret non-shortest UTF-8. You could just use an existing character or non-character for this, e.g., U+303E or U+ or U+FDEF or similar. This is what I do meanwhile. But it's uncomfortable, because most editors display all non-characters, unassigned characters or charakters not in the font all the same way - which hides the INDICATION. The SFIS should be displayed to remind the reader only THIS is a SFIS unlike all the other empty suqares in the text. Your suggested encoding U+FFF8 wouldn't work this way, by the way. U+FFF8 is reserved for format control characters -- and those characters display *invisibly* by default -- not as an empty square (or other fallback glyph) like miscellaneous symbols which happen not to be in your fonts. I think Marku's suggestion is correct. If you want to do something like this internally to a process, use a noncharacter code point for it. If you want to have visible display of this kind of error handling for conversion, then simply declare a convention for the use of an already existing character. My suggestion would be: U+2620. ;-) Then get people to share your convention. I'm not intending to be facetious here, by the way. One problem that character encoding runs into is that there are plenty of people with good ideas for encoding meanings or functions, and those ideas can end up turning into requests to encode some invented character just for that meaning or function. For example, I might decide that it was a good idea to have a symbol by which I could mark a following date string as indicating a death date--that would be handy for bibliographies and other reference works. Now I could come to the Unicode Consortium and ask for encoding of U+ DEATH DATE SYMBOL, or I could instead discover that U+2020 DAGGER is already used in that meaning for some conventions. There are *plenty* of symbol characters available in Unicode -- way more than in any other character encoding standard. And it is a much lighter-weight process to establish a convention for use of an existing symbol character than it is to encode a new character specifically for that meaning/function and then force everyone to implement it as a new character. Additional I think we should have a standardized way to display old utf-8 text without losing information (overlong utf-8 was allowed for years) Not really. And in any case, there is nothing to be gained here by displaying old utf-8 text without losing information. The way to deal with that is to *filter* it into legal UTF-8 text, by means of an explicit process designed to recover what would otherwise we rejected as illegal data. - gyphing is not a fine way and simply decoding the overlong forms is not allowed. This is a self-made problem, so unicode should provide an inherent way to solve it. There are plenty of ways to solve these things -- by API design or by specialized conversions designed to deal with otherwise unrepresentable data. But trying to bake conversion
RE: Character identities
Michael asked: My eyes have glazed over reading this discussion. What am I being asked to agree with? Here's the executive summary for those without the time to plow through the longer exchange: Marco: It is o.k. (in a German-specific context) to display an umlaut as a macron (or a tilde, or a little e above), since that is what Germans do. Kent: It is *not* o.k. -- that constitutes changing a character. [Sorry, guys, if I have ridden roughshod over the nuances... ;-)] Michael, you might have to recuse yourself, however, since when it was suggested that displaying Devanagari characters with snowpeaked glyphs for a Nepali hiking company would be o.k., you misunderstood and suggested private use characters! --Ken
Re: Character identities
Hm, what if I want to make, say, snow capped Devanagari glyphs for my hiking company in Nepal? Shouldn't I assign them to Unicode code points? That's what Private Use code positions are for. -- Michael Everson * * Everson Typography * * http://www.evertype.com Um, Michael, I think Anto'nio was talking about glyphs in a decorative font, which should -- clearly -- just be mapped to ordinary Unicode characters, via an ordinary Unicode cmap. Or do you think that the yellow, cursive, shadow-dropped, 3-D letters Getaway! at: http://www.trekking-in-nepal.com/ should also be represented by Private Use code positions? ;-) --Ken
Re: Origin of the term i18n
Raymond Mercier asked: Isn't i18n rather off-list ? Neither Sarasvati nor the self-styled list police have objected. While historical origin discussions are OT, they do seem to have an interested following on the Unicode list. Perhaps more to the point, Unicode implementations are all about i18n (or internationalization -- however you want to spell it). And the UTC and L2 committees consider internationalization to be a part of their overall area of concern. And the Unicode conferences definitely cover internationalization issues -- and even some of the details of localization. Is this the same list where people objected to the endless arguments with William Overington ? Yep. But at least nobody on this thread -- to date -- has claimed a new invention, proposed to encode i18n in user space, or proposed lyrics about it to be posted in their family webspace. --Ken ;-)
Re: Origin of the term i18n
Sorry to appear the curmudgeon, but ^^ recte: c8n --K1n
Re: Origin of the term i18n
Mark, Mark, I am curious why you find this term so distasteful? Is it the algorithm itself or just a general objection to acronyms and the like? Or something else entirely? I find this particular way of forming abbreviations particularly ugly and obscure. It is also usually unnecessary; looking at any of the messages brought up by Google, the percentage of 'saved' keystrokes is a very small proportion of the total count. And when it leaks out into the general programmer community, it just looks odd. For me, it is on the same order as using nite for night, or cpy for copy. u shuld just be glad u wont live to see the day when netspeak roolz and ur goofy language is rOxXoRed! --K1n
Re: Historians- what is origin of i18n, l10n, etc.?
W0e n3r u2d t1e g1d-a3l, g3y a1d o5e a10n i18n, h5r! What I don't understand, since these a10n's are in such widespread use among programmers and character encoders, is why they don't use h9l, as in i12n, lan, and gbn? --K1n BTW, these aan's are not only o5e, they are also o4e, but unfortunately, not o6e in use.
Re: ISO 8859-11 (Thai) cross-mapping table
Elliotte Harold asked: The Unicode data files at http://www.unicode.org/Public/MAPPINGS/ISO8859/ do not include a mapping for ISO-8859-11, Thai. Is there any particular reason for this? Just that nobody got around to submitting and posting one. Since there was a lot of discussion about this over the weekend, I took it upon myself to create and post one in the same format as the other ISO8859 tables. Let me know if anybody spots any problems in the table -- but it really is pretty straightforward, as others noted: TIS 620-2533 (1990) with one addition: 0xA0 NO-BREAK SPACE. Doug dug out: These 9 code positions (0xA0, 0xDB..0xDE, 0xFC..0xFF) appear to be undefined in TIS 620.2533. Reference [3] below does show a word separator character at 0xDC, which I interpret as U+200B ZERO WIDTH SPACE, but the other positions are still undefined. Reference [3] is online Tru64 Unix documentation about its Thai support, which claims that: - No-Break space. The character code is A0. ... - Word separator. The word separator defined in TIS 620-2533. This despite the fact that the table shown has no no-break space shown at A0 (and TIS 620-2533 (1990) does not have it), and that 0xDC is undefined in TIS 620-2533, despite the fact that the table in the Tru64 Unix documentation shows word sep. there. The table is labelled the TACTIS Codeset for Thai API Consortium/ Thai Industrial Standard. I surmise that this is some vendor extension to the actual TIS 620-2533 (1990). The actual standard states clearly (in Thai) that 0x80..0xA0, 0xDB..0xDE, and 0xFC..0xFF are reserved (unassigned), and the tables in the standard match that. So there may be some implementation practice that uses 0xDC for U+200B ZERO WIDTH SPACE in Thai code pages, but that is not part of either TIS 620-2533 (1990) nor ISO 8859-11:2001. --Ken
Re: Sporadic Unicode revisited
Keld responded: On Wed, Oct 02, 2002 at 02:47:42PM -0400, John Cowan wrote: Mark Davis scripsit: Those mnemonics in (http://www.faqs.org/rfcs/rfc1345.html) are pretty useless in practice, as well as being misnamed. From Websters: assisting or intended to assist memory. So what about the combination ;S is supposed to aid or assist memory in coming up with U+02BF MODIFIER LETTER LEFT HALF RING? Beats me. ; in many (though not all) mnemonics means ogonek, so its presence here is reasonable, considering that this character (which appears only in ISO-IR-158) is the original High Ogonek. Since ISO-IR-158 is for Saami, perhaps S stands for Saami. Writing S; would erroneously suggest S with ogonek. Well, the S stands for superscript, s here would mean subscript. Or shade, as in: .S 2591LIGHT SHADE :S 2592MEDIUM SHADE ?S 2593DARK SHADE Or space, as in: BS 0008BACKSPACE (BS) SP 0020SPACE IS 3000IDEOGRAPHIC SPACE NS 00a0NO-BREAK SPACE (not to be confused with: nS 207fSUPERSCRIPT LATIN SMALL LETTER N) Or spade, as in: cS 2660BLACK SPADE SUIT cS-2664WHITE SPADE SUIT Or Z, as in: DS 0405CYRILLIC CAPITAL LETTER DZE (Macedonian) Or selected, as in: ES 0087END OF SELECTED AREA (ESA) SA 0086START OF SELECTED AREA (SSA) Or separator, as in: FS 001cFILE SEPARATOR (IS4) GS 001dGROUP SEPARATOR (IS3) RS 001eRECORD SEPARATOR (IS2) US 001fUNIT SEPARATOR (IS1) Or square, as in: fS 25a0BLACK SQUARE OS 25a1WHITE SQUARE SR 25acBLACK RECTANGLE Or set, as in: HS 0088CHARACTER TABULATION SET (HTS) VS 008aLINE TABULATION SET (VTS) Or standard, as in: KSC327fKOREAN STANDARD SYMBOL Or start, or string, as in: SS 0098START OF STRING (SOS) ST 009cSTRING TERMINATOR (ST) SX 0002START OF TEXT (STX) SG 0096START OF GUARDED AREA (SPA) SH 0001START OF HEADING (SOH) Or substitute, as in: SB 001aSUBSTITUTE (SUB) Or synchronous, as in: SY 0016SYNCRONOUS IDLE (SYN) Or state, as in: TS 0093SET TRANSMIT STATE (STS) Or shift, as in: SI 000fSHIFT IN (SI) SO 000eSHIFT OUT (SO) Or single, as in: SC 009aSINGLE CHARACTER INTRODUCER (SCI) Or sun, as in: SU 263cWHITE SUN WITH RAYS Or section, as in: SE 00a7SECTION SIGN Or service, as in: SM 2120SERVICE MARK Of something-or-other (or spot ?), as in: Sb 2219BULLET OPERATOR Sn 25d8INVERSE BULLET {Excuse me if I tend to confuse those two with Antimony and Tin, respectively, creating a mnemonic antinomy.} Or, of course, S: S 0053LATIN CAPITAL LETTER S The wondrous thing about this set of mnemonic symbols is that you need a mnemonic system to remember all the mnemonics. --Ken
Re: Sporadic Unicode revisited
John Cowan responded to Rick: (BTW, I agree with Mark about those ISO 14755 [recte: RFC 1345] abbreviations... They aren't very mnemonic. Many people have the charts available, so there is no great advantage to using mnemonics over simply using numbers or palettes.) They are easy to type, and what is more, easy to proofread. (This is the same argument I just made defending the ISO/SGML named character entities.) I agree that *some* of the ideas behind the mnemonics in RFC 1345 make sense. The idea of typing a' for a-acute, for example, is quite widespread, and useful in some circumstances. But RFC 1345 is so full of flaws as a system, that it just falls in on itself. By insisting on only using the portable character set instead of ASCII, it can't do the obvious for grave, circumflex, and tilde accents, for example, so you get: a! 00e0LATIN SMALL LETTER A WITH GRAVE a 00e2LATIN SMALL LETTER A WITH CIRCUMFLEX a? 00e3LATIN SMALL LETTER A WITH TILDE instead of the obvious and widely used: a`, a^, a~ Attempting to extend the system to Greek, Cyrillic, Hebrew, and Arabic just (in my opinion) results in mnemonics that are harder to remember than the character names, even. What is the real advantage of s*, s=, S+ and s+ over sigma, es, samekh and seen for occasional usage? You end up having to look up all those mnemonics in a table anyway, if you actually want to use them. And the system gets even sillier when it is expanded to some arbitrarily defined subset of 10646 symbols and other characters, resulting in ample evidence of the inextensibility of a basically two-letter scheme when attempting to represent a large arbitrary set of things. Combinations like '? are not particularly easier to type than ~ or even tilde, and there are many similar examples. But most of all, in my opinion, the RFC 1345 mnemonics fail of a fundamental criterion: a very substantial portion of them are just not *memorable*. --Ken
Re Permission to reproduce?
Martin Kochanski asked: I want to post a Cardbox database on our Web site (Cardbox is the database that we sell) that contains a list of all Unicode characters: hexadecimal code, decimal code, character, and character name (eg. GREEK CAPITAL LETTER OMEGA WITH TONOS). The first three of these elements are in the public domain, but it strikes me that the character names might be considered to be a literary work and therefore copyright. Does anyone know whether I do in fact need to ask permission before listing those names, and if so, whom I need to ask? In case it wasn't clear from the short discussion that followed, let me state for the record: The character names are a normative part of the Unicode Standard, and are also identically defined as a normative part of the International Standard, ISO/IEC 10646 (English version). They are, indeed, a part of those publicly available standard(s), intended for free, unrestricted use by all users of those standard(s). So you don't need to ask anyone's permission to list or otherwise use those character names. You *would* have to ask permission (from the Unicode Consortium) before reproducing the exact *form* of the Unicode code charts, as printed in the Unicode Standard itself, since the form of the charts and associated name lists printed there *are* under copyright. --Ken
Pound and Lira (was: Re: The Currency Symbol of China)
Marco Cimarosti scripsit: The same should be true for the £ sign. But unluckily, for some obscure reason, Unicode thinks that currencies called pound should have one bar and be encoded with U+00A3, while currencies called lira should have two bars and be encoded with U+20A4. Every character has its own story. Can the old farts^W^Wtribal elders shed any light on this one? Not much. The proximate cause of the inclusion of U+20A4 LIRA SIGN in 10646 was: WG2 N708, 1991-06-14, Table of Replies (to the ballot on 10646 DIS, DIS-1). That document contains the U.S. comments asking for all the additions which would synchronize the DIS repertoire with the Unicode 1.0 repertoire, and that included U+20A4 LIRA SIGN. It is a deeper subject to figure out how the LIRA SIGN got into Unicode 1.0 in the first place, and I don't have all the relevant documents to hand to track it down. It was certainly already in the April 1990 pre-publication draft of Unicode 1.0 which was widely circulated. I do recall the issue of one-bar versus two-bar yen/yuan sign being researched in detail and being explicitly decided. I also recall explicit (and tedious) discussions about the various dollar sign glyphs. I do not, however, recall any time spent in discussing the analogous problem of glyph alternates for the pound/lira sign, although it was probably mentioned in passing. So it is possible that the lira sign simply derives from a draft list that was standardized without anyone ever spending time to debate the pound/lira symbol unification first. It was probably in the same lists that distinguished yen/yuan sign before it was determined that distinguishing those two as a *character* was untenable. Those were heady days. It is generally much easier to track down why something was added post-Unicode 1.0 than it is to figure out how something got into Unicode 1.0 in the first place. To quote from a particularly memorable email I sent around on April 4, 1991 about an unrelated mistake that was almost made: The High Ogonek is symptomatic of one of the things wrong about the character standardization business, which encourages the blithe perpetuation of mistaken 'characters' from standard to standard, like code viruses. At least, in the past, the epidemic was constrained by the fact that the encoding bodies only had 256 cells which could get infected by such abominations as half-integral signs. Now, however,... the number of cells available for infection is vast, and the temptation to encode everybody else's junk just seems to have become irresistible... ...I don't think I would be telling any tales out of school if I revealed that Unicode almost got a 'High ogonek', too, since Unicode was busy incorporating all the 10646 mistakes in Unicode while 10646 was busy incorporating all the Unicode mistakes in 10646. ... --Ken
RE: The Currency Symbol of China
Barry Caplan wrote [further morphing this thread]: I also think (but I could be wrong) that ye is not one of the characters in the famous Buddhist poem that uses each of the kana once and only once, and establishes a de facto sorting order by virtue of being the only such poem. OTOH, I am pretty sure that poem is either from or post-dates the Heian era, so it wouldn't rule out your point. In a totally different context, I was looking into this recently and found some stuff the list might find amusing. The kana that is usually missing from the poem is -n, i.e. U+3093. quotemyself P.S. In case you don't have it already, the i-ro-ha order is: i ro ha ni ho he to chi ri nu ru wo wa ka yo ta re so tsu ne na ra mu u wi no o ku ya ma ke fu ko e te [^ that is one ] a sa ki yu me mi shi ye hi mo se su [^ that is the other -- probably should be (w)e ] See, e.g., http://ccwww.kek.jp/iad/fink/western/wIJ2.html [Attributed to middle Heian, around A.D. 1000.] It was actually printed in the Unicode 1.0 book, when the circled Katakana characters at U+32D0..U+32FE were in i-ro-ha order. That was changed for Unicode 1.1, to synch up with the preferring a-i-u-e-o order for these characters in 10646. BTW, the translation of Kukai's iroha poem at that link leaves much to be desired, though the various version shown in hiragana, katakana, and with kanji are interesting. A much, much better translation can be found at: http://www.raincheck.de/html/i-ro-ha___english.html or, in German(!), at: http://www.raincheck.de/html/i-ro-ha.html The English translation is quite literal. The German -- how shall I put it -- takes some poetic license. ;-) /quotemyself Or, for a really challenging version, you can try puzzling out: http://www.miho.or.jp/booth/html/imgbig/3247e.htm which shows a manyoogana version (all kanji, used syllabically), tacking on the epenthetic U+65E0 mu for the -n, which some versions of the poem do, just to be tidy. --Ken
Re: glyph selection for Unicode in browsers
Tex, 3) The language information used to be derived dubiously from code page and is missing with Unicode, and architecture needs to accomodate a better model for bringing language to font selection. The archetypal situation is for CJK, and in particular J, where language choice correlates closely with typographical preferences, and where character encoding could, in turn, be correlated reliably with language choice. But in general, the connection does not hold, as for data in any of hundreds of different languages written in Code Page 1252, for example. What you are really looking for, I believe, is a way to specify typographical preference, which then can be used to drive auto-selection of fonts. I don't think we should head down the garden path of trying to tie typographical preference too closely to language identity, however we unknot that particular problem. This could get you into contrarian problems, where browsers (or other tools) start paying *too* much attention to language tags, and automatically (and mysteriously) override user preferences about the typographical preferences they expect for characters. What is needed, I believe, is: a. a way to establish typographic preferences b. a way to link typographical preference choices to fonts that would express them correctly c. a way to (optionally) associate a language with a typographical preference And this all should be done, of course, in such a way that default behavior is reasonable and undue burdens of understanding, font acquisition, installation, and such are not placed on end-users who simply want to read and print documents from the web. A tall order, I am sure. But as long as we are blue-skying about architecture for better solutions, I think it is important not to replace one broken model (code page = language) with another broken model (language = font preference). --Ken
Re: Sequences of combining characters (from Romanization of Cyrillic andByzantine legal codes)
William Overington asked: While on the topic, how would the following sequence be displayed please? U+0074 U+0361 U+0073 ZWJ U+0307 Just like: U+0074 U+0361 U+0073 U+0307 The sequence U+0073, ZWJ, U+0307 could request a ligature of the s and the dot-above, but since it is unlikely that any type designer is going to actually ligate the dot into the s and produce a ligature glyph for it, the sequence is likely to be rendered as if it were just U+0073, U+0307, that is an s with a dot-above. I am not suggesting this for bibliographic work, just wondering: for the bibliographic work I feel that a new character of a COMBINING DOUBLE INVERTED BREVE WITH DOT ABOVE might be a good solution. Possibly. It is certainly a simple solution. --Ken William Overington 25 September 2002
Re: Keys. (derives from Re: Sequences of combining characters.)
Peter responded: A document would contain a sequence such as follows. U+2604 U+0302 U+20E3 12001 U+2460 London U+2604 U+0302 U+20E2 You could just as easily have used S C=12001London/S or S C=12001 P1=London/ or even: cometcircumflex messageId=12001London/cometcircumflex if one likes the ring of comet circumflex for one's tags. which are only slightly more verbose, but which follow a widely-implemented standard namely, XML, which I think effectively gainsays William's earlier comment: XML does not suit my specific need as far as I can tell. And as far as the idea of having parameterized messages, with translation catalogs, I would join the chorus inviting William to investigate state of the art before attempting to invent something that already exists in many forms. Or, to further mangle Marco's musical metaphor, as you go round and around on this topic, make sure that you don't mix up the apples *for* the horses with the horseapples *from* the horses. --Ken ;-)
Re: Sequences of combining characters (from Romanization of Cyrillic and Byzantine legal codes)
Charles Cox suggested: Might there be a case for defining an invisible combining enclosing mark (ICEM), which is otherwise identical to the enclosing circle? Then, if I've understood the conventions correctly the sequence: U+0074 U+034F U+0073 ICEM U+0311 U+0307 would give ts with a centrally placed inverted breve and a centrally placed dot above the inverted breve. We have talked about that option. It has a certain elegance to it as well. But implementers are getting very leery of continuing to add invisible format control characters of various types into the mix. They often seem to introduce unanticipated problems for rendering systems. My current feeling is that while we have demonstrable cases of visibly ligated digraphs with dots above in print, it isn't clear that we have a significant data representation problem that *requires* the introduction of some new mechanism -- yet. This stuff *can* all be handled with appropriately designed ligations in fonts, so there are options for display: U+0074, U+0361, U+0073, U+0307 == maps via ligation table to: {t-s-tie-ligature-with-dot-above} glyph even though the default rendering would be: {t-s-dot-tie-ligature} glyph --Ken
Re: Sequences of combining characters (from Romanization of Cyrillic andByzantine legal codes)
Peter said: This stuff *can* all be handled with appropriately designed ligations in fonts, so there are options for display: U+0074, U+0361, U+0073, U+0307 == maps via ligation table to: {t-s-tie-ligature-with-dot-above} glyph I would consider this an anomolous rendering. It is counter-exemplified by figure 7-6 in TUS3.0. I'd be concerned of longer-term problems if we decided to say that this was a valid alternate rendering from {t-s-dot-tie-ligature} glyph Well, yes, it would be anomalous, which is why it would require somebody to go to the trouble to make a special ligation table entry for it. But what longer-term problems are you talking about? I didn't say we should put in a formal rendering *rule* in the Unicode Standard that says something different from Figure 7-6, along the lines of converting one form to the other as above. Look, let's consider again what problem we are trying to solve here. We have two funky forms from the ALA-LC transliteration tables, for which we haven't heard back yet from bibliographic sources whether there actually is any *actual* data representation problem in USMARC records. We can try to invent and promulgate a generic rendering solution for these cases (and anything like them) in the Unicode Standard, despite the fact that they are an edge case of an edge case for Latin script rendering... Or, if it turns out that it isn't a general-enough problem to force everyone to deal with it in terms of generic rendering, we could suggest alternatives: a. markup solutions b. specific font ligation solutions for specialized data Now consider again the function of these things in the ALA-LC transliteration. The Cyrillic transliteration recommendations make rather extensive use of ligature ties. Why? Because the ALA-LC transliteration schemes make some effort to be round-trippable. In other words, the Cyrillic transliteration they recommend is not merely a useful romanization that might be in more general use, as for a newspaper, but is a romanization from which, in principle, you ought to be able to recover the Cyrillic it was transliterated from. Thus these schemes distinguish t-s from t-s-tie-ligature, since the ligated form might be a transliteration of a tse or similar letter, whereas the t-s would be a transliteration of a te+es, and so on. In other words, the tie-ligatures are being sprinkled in to make ad hoc digraphs for the transliteration, to aid in recovery of the Cyrillic from the romanization. Now the dots above typically represent an articulatory diacritic, as for palatalization, or the like. So the combination of the two is to indicate: we are transliterating a letter with a palatal (say) diacritic, using a digraph. Do we have alternatives in Unicode for that? Well, yes, depending on whether the problem is: a. enabling exact transcoding of the USMARC data records using ALA-LC romanization recommendations and the ANSEL character set, for interoperability with Unicode systems. or b. typesetting the ALA-LC romanization document guide in Unicode, treating all the data therein as plain text and using generic Unicode rendering rules. I contend that the primary problem is a), and that we ought to examine the general usefulness of this dot-above-double-diacritic and related rendering, before we insist it has to be representable in plain text and go looking for an encoding solution and specify a bunch of rendering rules for it. If the essential requirement here is to capture the data functionality of the transliteration: a roundtrippable form, with a palatal diacritic, using a digraph, we could suggest, for instance: U+0074, U+034F, U+0073, U+0307 or U+0074, U+0307, U+034F, U+0073 where we end up with an explicitly indicated digraph, with a dot-above diacritic (pick which letter you want it on), as a grapheme cluster. This is distinct from: U+0074, U+0073, U+0307 or U+0074, U+0307, U+0073 so you have your transliteration round-trippability intact. And for your special-purpose application, which is a Unicode system to display USMARC bibliographic records using the ALA-LC romanization presentation conventions, you add ligation entries to your font so that U+0074, U+034F, U+0073, U+0307 and similar forms using a U+034F GRAPHEME JOINER display with a visible tie-ligature, rather than nothing, despite the fact that no U+0361 double diacritic is being used in the data. Problem solved. Of course, that doesn't mean that your converted USMARC data records involving digraphs for Cyrillic transliteration will display with the tie-ligature in a generic web application using off-the-shelf fonts -- but is that the problem we are trying to solve here? I doubt it. The forms would be legible -- perhaps more legible without the obtrusive ties cluttering them up -- and the data distinctions would still be preserved in such contexts. --Ken
Re: Sequences of combining characters (from Romanization of Cyrillic and Byzantine legal codes)
William Overington asked: In the discussion about romanization of Cyrillic ligatures I asked how one expresses in Unicode the ts ligature with a dot above. Regarding Ken's response to the Byzantine legal codes matter, it would appear possible that the way that the ts ligature with a dot above for romanization of Cyrillic could be represented in Unicode would be by the following sequence. t U+FE20 s U+FE21 U+0307 The ordinary ts ligature for romanization of Cyrillic being expressed as follows. t U+FE20 s U+FE21 As Peter indicated, the preferred way to represent this graphic ligature tie in Unicode is with the double diacritics, i.e.: t U+0361 s U+FE20 and U+FE21 are compatibility characters, for interoperation, in particular, with the USMARC catalog records using the Extended Latin Alphabet Coded Character Set for Bibliographic Use (ANSEL). See: http://lcweb.loc.gov/catdir/cpso/romanization/charsets/pdf It appears to me that the ts ligature with a dot above, and a similar ng ligature with a dot above, are already needed for the Library of Congress romanization of Cyrillic system. The following directory contains a lot of pdf files. http://lcweb.loc.gov/catdir/cpso/romanization The ts ligature with a dot above can be found on page 2 of the nonslav.pdf file. The ng ligature with a dot above can be found on page 13 of the same file. And, in particular, the ts ligature with a dot above is for an Abkhaz romanization, and the ng ligature with a dot above is for an obsolete Mansi (related to Khanty) romanization. I suspect their actual use is pretty limited. Capital letter versions of the two ligatures are needed as well. Well, this is interesting, since these were *added*, systematically, to the 1997 version of the ALA-LC non-Slavic romanization systems. The 1990 version did not have them. That raises the question of whether these were simply editorial extensions, or were actually *needed* for some bibliographical data. I consider it unlikely that all of the capital forms were suddenly discovered between 1990 and 1997 and that a whole bunch of USMARC bibliographical records making use of the capital forms were created during that interval. In this regard, one should *read* the ALA-LC document. See charsets.pdf: The transliterations produced by applying ALA-LC Romanization Tables are encoded in machine-readable form into USMARC records. Encoding of the basic Latin alphabet, special characters, and character modifiers listed in this publication is done in USMARC records following two American National Standards; the Code for Information Interchange (ASCII) (ANSI X3.4), and the Extended Latin Alphabet Coded Character Set for Bibliographic Use (ANSEL) (ANSI Z39.47). Each character is assigned a unique hexadecimal (base-16) code which identifies it unambiguously for computer processing. The current version of how that is done is the MARC 21 Specifications for Record Structure, Character Sets, and Exchange Media. Among other things, that specification spells out how the combining marks are used with base characters in USMARC records. I don't know, however, if any provision was actually made in MARC 21 for these instances of ligature ties with dots above, however. Perhaps someone familiar with the details of USMARC can answer that. The USMARC records (using ANSEL) *would*, however, be making use of the half ligature characters: 0xEB LIGATURE, FIRST HALF 0xEC LIGATURE, SECOND HALF as well as: 0xE7 SUPERIOD [sic] DOT (s.b. SUPERIOR DOT) It just isn't clear exactly what order these would occur in any hypothetical USMARC record actually using either the Abkhaz or Mansi romanizations in question. I wonder if consideration could please be given as to whether this matter should be left unregulated or whether some level of regulation should be used. I think this should depend first on a determination of whether there is a demonstrated need for an actual representation of these sequences -- which ought to be determined by the people responsible for the data stores which might contain them, namely the online bibliographic community. The ALA-LC conventions are not the only alternatives available for representation of Abkhaz and/or Khanty/Mansi data in romanization. In fact, you can find such data on the web using alternative romanizations. So it isn't as if the current gap in figuring out precisely how, in Unicode, to represent a double diacritic with another diacritic applied outside the visible double diacritic on a digraph is preventing anyone from using romanized Abkhaz or Khanty/Mansi data in interchange. --Ken William Overington 18 September 2002
Re: Sequences of combining characters (from Romanization of Cyrillicand Byzantine legal codes)
The ALA-LC conventions are not the only alternatives available for representation of Abkhaz and/or Khanty/Mansi data in romanization. In fact, you can find such data on the web using alternative romanizations. So it isn't as if the current gap in figuring out precisely how, in Unicode, to represent a double diacritic with another diacritic applied outside the visible double diacritic on a digraph is preventing anyone from using romanized Abkhaz or Khanty/Mansi data in interchange. By the same argument, Unicode might as well stop taking new characters; surely, between the 500 Latin characters and dozens of punctuation marks and combining characters and the other 70,000 characters, you can find a way to communicate whatever language or data you need communicated. Of course. Let them use ASCII, for that matter. But that wasn't my point. There is no particular evidence that the ALA-LC conventions with the dot above the graphic ligature ties is in widespread use for romanizations of these particular languages, that I can see. So the *urgency* of solving this problem isn't there, unless the LC/library/bibliographic community comes to the UTC and indicates that they have a data interchange problem with USMARC records using ANSEL that requires a clear representation solution in Unicode. And before we go there, I'd like to have a clear specification of how it works in USMARC records, so we would know how to do the following conversion: USMARC -- Unicode for the two forms in question. The 1990 version of the LC romanizations for this non-Slavic stuff used all kinds of hand-drawn forms. And even the 1997 version of the ALA-LC document is photo-offset from pages that include various kinds of pasteup from who-knows-what sources, including some hand-drawn, with at least one of these dots above being added by hand. So it isn't clear that there is any connection between the ALA-LC document text and the ANSEL character encoding actually used in the USMARC records; this could be arbitrary markup with some system like TEX for publication. BTW, if we are blueskying about this topic, the *elegant* way to resolve this would be to recategorize all the double diacritics as *enclosing* combining marks (Me), rather than Mn, and then rewriting the conventions for their use to match those of the enclosing circle and such. Then they would subtend (or supertend) any grapheme cluster, including arbitrary digraphs indicated with a COMBINING GRAPHEME JOINER character. And a dot above would neatly apply to the entire subtended cluster, as for circled characters, and so on. Of course, that would invalidate anybody's current usage of the characters. Oh well, you can't win 'em all. --Ken
Re: French or German Unicode Names??
Ms. Hughes, ISO/IEC 10646-1:2000, which is exactly correlated with the Unicode Standard, Version 3.0, is available in French. You can purchase a copy from ISO: http://www.iso.ch/ (Go to the ISO Store section of the site and search for the ISO number 10646.) I don't know of any German translation of all the character names. As far as I know, German users of the standard simply make use of the English names of the characters. But you could confirm by contacting the German standards organization, DIN: http://www.din.de [EMAIL PROTECTED] --Ken Whistler - Begin Included Message - -Original Message- Date/Time:Tue Sep 17 01:20:08 EDT 2002 Contact: [EMAIL PROTECTED] Report Type: Other Question, Problem, or Feedback HI, We are trying to find a set of the unicode tables with the character labels in French, and one where they are in German. Do you have these available, or can you point us in the direction of where we might find them please? Kind regards, Maryanne Hughes Technical Writer, Pulse Data International, New Zealand -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- (End of Report) - End Included Message -
UTF-8 (was Re: Mercury News: Hawaiian on a Mac)
Markus Scherer responded: Stefan Persson wrote: This links to a different page on the same server: http://www.cl.cam.ac.uk/~mgk25/unicode.html That page contains a strange UTF-8 table: ... The last two byte sequences are invalid. Markus Kuhn's page shows the original ISO 10646 definition. and still current ISO/IEC 10646 definition. Table D.1 in Annex D UCS Transformation Format 8 (UTF-8). Note that the definition of the 5- and 6-byte UTF-8 sequences for code positions past U-001F is essentially harmless, as ISO/IEC 10646 now contains explicit language indicating the non-intention to encode any characters at code positions past U-0010. So the definition of the 5- and 6-byte sequences is vacuous -- no such sequence will ever be a valid representation of an *encoded character* in 10646. This necessarily includes all codes up to 7FFF. It also includes D800..DFFF, which is not allowed in Unicode 3.2 and the RFC on UTF-8, and I think implicitly not allowed in ISO 10646. They are *explicitly* not allowed in UTF-8 in ISO/IEC 10646 as well. From Clause D.4 Mapping from UCS-4 form to UTF-8 form: Values of x in the range D800 .. DFFF are reserved for the UTF-16 form and do not occur in UCS-4. The mappings of these code positions in UTF-8 are undefined. --Ken
Re: various stroked characters
Peter, Here's my take on your questions. The less clear cases involve b, d and g. 1) Lower case b with a horizontal stroke through the bowl (hereafter b-stroke-bowl) is used in some phonetic traditions for voiced bilabial fricative (beta, in IPA). The annotation for U+0180 (b with a horizontal stroke across the ascender) indicates that one of its intended purposes is for phonetic transcription of the same phone. Of course, U+03B2 (beta) also has this function and is not unified with 0180, but these are clearly distinct characters (e.g. 0180 and 03B2 have other unrelated functions). I can't imagine anyone using b-stroke-bowl contrastively with 0180. Thus, probably the best option is to treat b-stroke-bowl as a typographic variant of 0180. Any opinions confirming this view or to the contrary? I agree. This is what Pullum and Ladusaw called the Barred B, as opposed to the Indo-European Crossed B (i.e. U+0180): By a general convention, barred stop symbols (with a superimposed hyphen or short dash through the body of the letter) are often used to represent those fricatives for which the IPA symbols are not used. The resultant symbols have the advantage of being easy to type on an unmodified typewriter. By the way, there is also the Slashed B, which is another alternative form for the Barred B, used for the same purpose, but instantiated by typing b backspace / instead of b backspace -. For what it is worth, the founders of Unicode considered these three forms to be allographs of an abstract barred-b character, so that is what the current situation is. Trying to separately encode a Barred B distinct from the Crossed B would, at this point, constitute an explicit disunification, rather than simply a discovery of an overlooked character to encode. 2) Next, consider the g. The representative glyph in TUS3.0 for U+01E5 shows a double-bowl g with a horizontal stroke through both sides of the bottom bowl. The annotation indicates that it is used for Skolt Saami. Looking at a few fonts, I see some variations: Andale Mono and Code 2000 have a double-bowl g with a horizontal stroke through *the right side only* of the lower bowl; Lucida Sans Unicode and Arial Unicode MS have a single-bowl g with a horizontal stroke through the right side only of the bowl. Pullum and Ladusaw show two other glyphic alternatives: Barred G with an IPA style g and a horizontal stroke through the bowel. Crossed G with an IPA style g and a horizontal stroke through the descender. Now, what I'm concerned with is a g (single-bowl in all instances I'm familiar with) that has a horizontal stroke through both sides of the (upper -- only) bowl, used in some phonetic traditions to represent a voiced velar fricative (IPA gamma). Any opinions on whether to treat this as a new character or as a typographic variant of U+01E5? All allographs of the same underlying character. The same concepts and analogies apply here. The Crossed G was probably explicitly formed by analogy from the more-attested Crossed B and Crossed D. The ones with horizontal strokes through the bowel are all just variants on what happens when you backspace and put a hyphen across your g. 3) Finally, the d. Unicode has three upper-case stroked-d characters for which the representative glyphs are identical, but which have distinct lower-case counterparts (the basis for having three distinct upper-case characters). Of the three pairs, two really aren't relevant to this discussion. The one relevant pair is U+0110 LATIN CAPITAL LETTER D WITH STROKE, and U+0111 LATIN SMALL LETTER D WITH STROKE. Now, in some phonetic traditions, a d with a horizontal stroke through the bowl (both sides) is used for a voiced interdental fricative (IPA U+00F0). Some phonetic traditions represent this using U+0111. I've also learned of some African languages that are written with upper and lower stroked d; I've seen samples that show some glyph variation: some samples show a horizontal stroke that crosses both sides (both upper and lower case); other samples show the horizontal stroke on only one side -- through the stem of the upper case (just like U+00D0, U+0110 and U+0189), and through the right side of the bowl of the lower case (not through the ascender, as shown in the charts for U+0111). So, again: any opinions on whether d-stroke-bowl should be unified with U+0111 or considered a new character? Again, all allographs of the same underlying character. And once again, as for b, there are, in addition to the Crossed D and Barred D allographs, also a Slashed D allograph. There is no need to proliferate distinct encodings for these, whether the slashes of the Barred D forms go all the way across or just partway across either the lowercase and/or the uppercase forms. Those are just various typographic attempts to do decent design for the letter forms based on the concept of having to apply a horizontal stroke to the d/D
RE: Double Macrons on gh...
Robert Wheelock asked: Recently, I read some messages saying that there're 3 new double-wide overstruck accents are proposed for Unicode: Umm. Well, they aren't double-wide and they aren't overstruck, and their names are not: 035D: double-wide breve 035E: double-wide macron 035F: double-wide underbar (d-w combining low line) but rather: 035D COMBINING DOUBLE BREVE 035E COMBINING DOUBLE MACRON 035F COMBINING DOUBLE LOW LINE Please send me more info (and some documentation) on those accents. These would occur in sequences such as: o, combining double breve, o to give the effect of a breve stretched over a pair of o's, as often seen in Webster-style dictionary pronunciation guides. Technically, the combining double accents combine with the base letter they follow, but their glyphs would be designed so that they would overhang a following base letter as well. In practice, fonts might simply choose to have ligatures for the entire sequence, to avoid complications of calculating the accent positions dynamically. For more examples, just look in dictionary pronunciation guides. --Ken
Re: Revised proposal for Missing character glyph
[Resend of a response which got eaten by the Unicode email during the system maintenance last week. Carl already responded to me on this, but others may not have seen what he was responding to. --Ken] Proposed unknown and missing character representation. This would be an alternate to method currently described in 5.3. The missing or unknown character would be represented as a series of vertical hex digit pairs for each byte of the character. The problem I have with this is that is seems to be an overengineered approach that conflates two issues: a. What does a font do when requested to display a character (or sequence) for which it has no glyph. b. What does a user do to diagnose text content that may be causing a rendering failure. For the first problem, we already have a widespread approach that seems adequate. And other correspondents on this topic have pointed out that the particular approach of displaying up hex numbers for characters may pose technical difficulties for at least some font technologies. [snip] This representation would be recognized by untrained people as unrenderable data or garbage. So it would serve the same function as a missing glyph character except that it would be different from normal glyphs so that they would know that something was wrong and the text did not just happen to have funny characters. I don't see any particular problem in training people to recognize when they are seeing their fonts' notdef glyphs. The whole concept of seeing little boxes where the characters should be is not hard to explain to people -- even to people who otherwise have difficulty with a lot of computer abstractions. Things will be better-behaved when applications finally get past the related but worse problem of screwing up the character encodings -- which results in the more typical misdisplay: lots of recognizable glyphs, but randomly arranged into nonsensical junk. (Ah, yes, that must be another piece of Korean spam mail in my mail tray.) It would aid people in finding the problem and for people with Unicode books the text would be decipherable. If the information was truly critical they could have the text deciphered. Rather than trying to engineer a questionable solution into the fonts, I'd like to step back and ask what would better serve the user in such circumstances. And an approach which strikes me as a much more useful and extensible way to deal with this would be the concept of a What's This? text accessory. Essentially a small tool that a user could select a piece of text with (think of it like a little magnifying glass, if you will), which will then pop up the contents selected, deconstructed into its character sequence explicitly. Limited versions of such things exist already -- such as the tooltip-like popup windows for Asmus' Unibook program, which give attribute information for characters in the code chart. But I'm thinking of something a little more generic, associated with textedit/richedit type text editing areas (or associated with general word processing programs). The reason why such an approach is more extensible is that it is not merely focussed on the nondisplayable character glyph issue, but rather represents a general ability to query text, whether normally displayable or not. I could query a black box notdef glyph to find out what in the text caused its display; but I could just as well query a properly displayed Telugu glyph, for example, to find out what it was, as well. This is comparable (although more point-oriented) to the concept of giving people a source display for HTML, so they can figure out what in the markup is causing rendering problems for their rich text content. [snip] This proposal would provide a standardized approach that vendors could adopt to clarify missing character rendering and reduce support costs. By including this in the standard we could provide a cross vendor approach. This would provide a consistent solution. In my opinion, the standard already provides a description of a cross-vendor approach to the notdef glyph problem, with the advantage that it is the de facto, widely adopted approach as well. As long as font vendors stay away from making {p}'s and {q}'s their notdef glyphs, as I think we can safely presume they will, and instead use variants on the themes of hollowed or filled boxes, then the problem of *recognition* of the notdef glyphs for what they are is a pretty marginal problem. And as for how to provide users better diagnostics for figuring out the content of undisplayable text, I suppose the standard could suggest some implementation guidelines there, but this might be a better area to just leave up to competing implementation practice until certain user interface models catch on and get widespread acceptance. --Ken
Re: The Unicode Technical Committee meeting in Redmond, Washington State, USA.
William Overington inquired: As many readers may know, the Unicode Technical Committee was due to start a four day meeting yesterday at the Redmond, Washington State, USA campus of Microsoft, that is, on 20 August 2002. Here in England I am interested to know of what is happening and to learn of news from the meeting. As Sarasvati has indicated, minutes will be publicly posted in a few weeks. See: http://www.unicode.org/unicode/consortium/utc-minutes.html [BTW, the minutes from the February and April/May meetings have actually been approved, although their status has not been updated to Approved yet on the website page.] It is the early hours of the morning in Washington State at present. It is hoped that when delegates get up for breakfast that they might look in their emails and make early morning responses, or perhaps arrange for an official briefing to be posted later in the day. If I were conducting a live interview with the committee chairman or with an official spokesperson I would ask the following questions. Unfortunately, the UTC has not yet arranged its television contract with ESPN, since character encoding has not generally been considered a mass-appeal spectator sport. However, since I did attend the UTC meeting last week, I may be able to provide up-to-date commentary regarding some of the questions which are not better answered by waiting for the official minutes. * What was discussed yesterday (Tuesday) please, and what formal decisions, if any, were taken please? Wait for the minutes. * How many people attended please? 16 on Tuesday. 18 on Wednesday. Back down to 15(?) on Thursday and Friday. * Is it only companies which are full members of the Unicode Consortium who send delegates to the meeting, or are there also representatives of organizations who do not vote in decisions present as well? The latter. * Will there be a press statement at the close of the meeting please, and if so, will it also be posted in the Unicode mailing list please? No, there will not be a press statement. Encoding of a VERTICAL LINE EXTENSION character was not considered of such earth-shattering consequence that it would lead to headlines in the technology press. * Has there been, or is there on the agenda, any discussion of the wording in the Unicode specification about the use of the Private Use Area and, if so, are any changes to that wording being implemented? Not discussed by the UTC last week. This is in the purview of the editorial committee. * Has there been, or is there on the agenda, any discussion concerning the status of the code points U+FFF9 through to U+FFFC please? There has been some discussion recently in the Unicode mailing list about these code points, as regards issues of U+FFF9 through to U+FFFB as an issue, the issue of using U+FFFC as a single issue, and the issue of using U+FFF9 through to U+FFFC all together. Is the committee discussing these issues at all and, if so, are they discussing the matter of whether U+FFFC can be used in sending documents from a sender to a receiver please? Is there any discussion of a possible rewording, or changing of meaning, of the wording about the U+FFF9 through to U+FFFC code points in the Unicode specification please? Not discussed by the UTC last week. This is in the purview of the editorial committee. * Are any matters concerning how the Unicode specification interacts with the way that fonts are implemented being discussed please? Yes. In a general way, this ends up being discussed at every meeting. If so, is due care being taken that as font format is not, at present, an international standards matter that therefore the committee must take great care to ensure that Unicode does not become dependent upon a usage, express or implied, of the intellectual property rights or format of any particular font format specification? The UTC always attempts to exercise due care in what it considers, but it is unclear just what clarification you are asking for here. The UTC does not standardize font formats. * Is there any discussion of the possibility of adding further noncharacters please, considering either or both adding some more noncharacters in plane 0 and a large block of noncharacters in one of the planes 1 through to 14? No. * Is the committee discussing the issue of interpretation, namely as to how, if various people read the published specification so as to have different meanings, how people may receive a ruling as to the formally correct meaning of the wording of the specification. This recently arose in relation to the U+FFFC character and has previously arisen in relation to what is correct usage of the Private Use Area, so there are at least two areas where the issue of interpretation has arisen. No. The UTC is a standardization committee, not a court of law. If a problem of interpretation of the standard arises, and if the UTC thinks that is a
Re: The existing rules for U+FFF9 through to U+FFFC. (spins from Re: Furigana)
An interesting point for consideration is as to whether the following sequence is permitted in interchanged documents. U+FFF9 U+FFFC U+FFFA Temperature variation with time. U+FFFB That is, the annotated text is an object replacement character and the annotation is a caption for a graphic. Yes, permitted. As would also be: U+FFF9 U+FFFC U+FFFC U+FFFA U+FFF9 Temperature U+FFFA a measure of hotness, related to the U+FFF9 kinetic energy U+FFFA energy of motion U+FFFB of molecules of a substance U+FFFB U+FFF9 variation U+FFFA rate of change U+FFFB with time U+FFFC . U+FFFB Where the first U+FFFC is associated with a URL with a realtime data feed, the second U+FFFC is a jar file for a 3-dimensional dynamic display algorithm, and the third U+FFFC is a banner ad for Swatch watches. It seems to me that if that is indeed permissible that it could potentially be a useful facility. Permissible does not imply useful, however, in this case. It is unlikely that you are going to have access to software that would unscramble such layering in purported plain text, even if you had agreements with your receivers. That is what markup and rich text formats are for. Note that it is also *permissible* in Unicode to spell permissible as purrmisuhbal. That doesn't mean that it would be a good idea to do so, but the standard does not preclude you from doing so. You could even write a rendering algorithm which would display the sequence of Unicode characters p,u,r,r,m,i,s,u,h,b,a,l with the glyphs {permissible} if you so choose. --Ken
Re: Furigana
Doug (and Michael also): What if I *want* to design an annotation-aware rendering mechanism? Suppose I read Section 13.6 and decide that, instead of just throwing the annotation characters away, I should attempt to display them directly above (and smaller than) the normal text, the way furigana are displayed above kanji. This would work not only for typical Japanese ruby, but also for Michael's English-or-Swedish-over-Bliss scenario. It might even be useful in assisting beleaguered Azerbaijanis, for example, by annotating Latin-script text with its Cyrillic equivalent. (Just a thought.) Would this be conformant? Well, technically conformant, but not wise. If commonly available display and rendering mechanisms are not rendering them as interlinear annotations, then you aren't really providing much assistance here by using a mechanism designed for internal anchors and trying to turn it into something it isn't really up to snuff for. Frankly, you would be much better off making use of the Ruby annotation schemes available in markup languages, which will give you better scoping and attribute mechanisms. Stop worrying a moment about Why are these characters standardized, and why the hedoublehockeysticks can't I use them?! and think about the problem that furigana or any other interlinear annotation rendering system has to address: a. How are the annotations adjusted? Left-adjusted, centered, something else? And what point(s) are they synched on? b. If the annotated text or the annotation itself consist of multiple units, are there subalignments? E.g. note note note note text text textextextext text or note note note note text text textextextext text c. Can an annotation itself be stacked into a multiline form? note note note nononononote text d. Can the text of the annotation itself in turn be annotated? e. Can the text have two or more coequal annotations? And if so, how are they aligned? e. If the annotation is in a distinct style from the text it annotates, how is that indicated and controlled? f. How is line-break controlled on a line which also has an annotation? And so on. This is all the kind of stuff that clearly smacks to me of document formatting concerns and rich text. Why anyone would consider such things to be plain text rather escapes me. --Ken
Re: Scripts in Unicode 4.0
John Hudson mused: Love the HOT BEVERAGE character, but where's the TALL LOWFAT SOYMILK MOCHA FRAPPUCCINO? Come on guys, there's enough blank spaces in that block for the entire Starbucks beverage menu, especially if you treat things like EXTRA FOAM as a combining character. Well, Starbucks is #550 on the Fortune 1000 list, which puts them ahead of many other members of the Unicode Consortium. Perhaps we should just hold out for them to join the consortium before we start worrying about encoding their beverage menu. --Ken
Re: An idea for keeping U+FFFC usable. (spins off from Re: Furigana)
William Overington teased us all unmercifully with: It occurs to me that it is possible to introduce a convention, either as a matter included in the Unicode specification, or as just a known about thing, that if one has a plain text Unicode file with a file name that has some particular extension (any ideas for something like .uof for Unicode object file) ...or to pick an extension, more or less at random, say .html that accompanies another plain text Unicode file which has a file name extension such as .txt, or indeed other choices except .uof (or whatever is chosen after discussion) then the convention could be that the .uof file has on lines of text, in order, the name of the text file then the names of the files which contains each object to which a U+FFFC character provides the anchor. For example, a file with a name such as story7.uof might have the following lines of text as its contents. story7.txt horse.gif dog.gif painting.jpg This is a shaggy dog story, right? The file story7.uof could thus be used with a file named story.txt so as to indicate which objects were intended to be used for three uses of U+FFFC in the file story7.txt, in the order in which they are to be used. Or we could go even further, and specify that in the story7.html file, the three uses of those objects could be introduced with a very specific syntax that would not only indicate the order that they occur in, but could indicate the *exact* location one could obtain the objects -- either on one's own machine or even anywhere around the world via the Internet! And we could even include a mechanism for specifying the exact size that the object should be displayed. For example, we could use something like: img src=http://www.coteindustries.com/dogs/images/dogs4.jpg; width=380 height=260 border=1 or img src=http://www.artofeurope.com/velasquez/vel2.jpg; I can imagine that such a widely used practice might be helpful in bridging the gap between being able to use a plain text file or maybe having to use some expensive wordprocessing package. And maybe someone will write cheaper software -- we could call it a browser -- that could even be distributed for free, so that people could make use of this convention for viewing objects correctly distributed with respect to the text they are embedded in. Yes, yes, I think this is an idea which could fly. --Ken
Re: The mystery of Edwin U+1E9A
John Cowan asked: Where does this strange beast come from? Semitic transliteration practice, if I recall correctly. Its name is LATIN SMALL LETTER A WITH RIGHT HALF RING, and the right half ring is indeed above the a. We don't have a RIGHT HALF RING ABOVE combining mark, so it only gets a compatibility decomposition. It's not really an *above* diacritic, but a little 02BE hamza half ring sitting at the upper right shoulder. The Unicode 3.0 glyph looks odd to me -- the Unicode 2.0 glyph made more sense. It's more akin to U+0149 as an oddball addition to the standard. --Ken Who would need a lower-case letter with a unique diacritic, and no upper-case equivalent? The U+1Exx block is random junk inherited from 10646 DIS 1, Does anyone understand it?
Re: Discrepancy between Names List Code Charts?
This is my first posting to this list so please be gentle with me! *pounces and begins to play with the little furry creature (gently)* Can someone help me with this confusion as I am unsure how I should structure these WITH CEDILLA characters in fonts I'm working on. See TUS 3.0, pp. 162-163 for a discussion of these characters with cedillas (or ogoneks) below. The characters whose names are XXX WITH CEDILLA often (but not always) show variation between glyphs with cedillas and glyphs with commas below (or even other hooklike shapes). This variation is conditioned by at least: the shape of the letter itself, where a rounded bottom or a flat line in the center of the bottom of the character lends itself to a cedilla attachment, but a glyph such as that for a k does not; by the particular language being rendered; by different typographical traditions; and by font styles. The characters whose names are XXX WITH COMMA BELOW are intended to be just rendered with commas below -- ordinarily they should never show up with a cedilla in the glyph. For the Latvian letters you are probably best off following the conventions as currently shown in the code charts and as used in Arial MS Unicode, rather than earlier fonts. Am I just displaying my ignorance of European writing systems or does the Unicode Names List and/or the Code Charts need updating???!!! The names list is correct, and cannot be updated -- the character names are fixed and unchangeable. The Code Charts have been updated already, with the Unicode 3.0 (and later) charts showing the glyph conventions recommended in the discussion in the text of the standard, whereas the Unicode 2.0 (and earlier) charts showed cedillas universally for all of the Latvian characters. --Ken
Re: Double Macrons on gh (was Re: Tildes on Vowels)
James Kass asked: Please note that both the UTC and WG2 have approved a new set of combining double accents: U+035D COMBINING DOUBLE BREVE U+035E COMBINING DOUBLE MACRON U+035F COMBINING DOUBLE LOW LINE snip Now, the question is, how long will it take for the fonts and browsers to catch up on those forms, as well?? The other double combiner marks already work fairly well in default position in existing browsers. These ought to work right out-of-the- box, once fonts include glyphs. Is it safe to include glyphs for the above referenced characters now? Well none of the Unicode 4.0 extensions will be entirely safe to use until after the December WG2 meeting in Tokyo. But my personal opinion is that these 3 are pretty unlikely to be disturbed by comments in national balloting between now and then. --Ken
Re: Furigana
I want to be able to send a Blissymbol string with a gloss in English or Swedish attached. Nothing to do with Japanese whatsoever. Basically, as for all things annotational or interlineating, this is an excellent application for markup. --Ken
Re: Furigana
Michael, At 14:16 -0700 2002-08-13, Kenneth Whistler wrote: I want to be able to send a Blissymbol string with a gloss in English or Swedish attached. Nothing to do with Japanese whatsoever. Basically, as for all things annotational or interlineating, this is an excellent application for markup. When this was discussed in WG2 in Japan before they went in, I asked specifically, could I use this method to put Anglo-Saxon glosses on Latin text. The answer was positive, so it received my support. Were these always pre-deprecated? Why are they in the standard if no one is going to be allowed to use them? Read the discussion which has been published in the Unicode Standard ever since these things were available. TUS 3.0, pp. 325 - 326. The annotation characters are used in internal processing when ^^^ out-of-band information is associated with a character stream, very similarly to the usage of the U+FFFC OBJECT REPLACEMENT CHARACTER... Usage of the annotation characters in plain text interchange is strongly discouraged without prior agreement between the sender and received because the content may be misinterpreted otherwise... When an output for plain text usage is desired and when the receiver ^ is unknown to the sender, these interlinear annotation characters should be removed... ^ The Japanese national body was very clear about this, and was opposed to these going into the standard unless such clarifications were made, to ensure that these were not intended for plain text interchange of furigana (or other similar annotations). --Ken
Re: Furigana
Michael Everson (in training as a curmudgeon) harrumpfed ;-) The Japanese national body was very clear about this, and was opposed to these going into the standard unless such clarifications were made, to ensure that these were not intended for plain text interchange of furigana (or other similar annotations). Well then they oughtn't to have been encoded. Yes, we agree that hindsight is a wonderful skill. This function would better be served by noncharacter code points, but nobody had quite figured out how to articulate that yet. But even at the time, as the record of the deliberations would show, if we had a more perfect record, the proponents were clear that the interlinear annotation characters were to solve an internal anchor point representation problem. Nobody (well, maybe somebody) expected them to serve as a substitute for a general markup mechanism for indication of annotation, and in particular, interlinear annotations. I recall at the time I pointed out that as a linguist I had routinely made use of 4-line interlinear annotation formats, and that this simple anchoring scheme couldn't even begin to represent such complexities in a usable fashion. --Ken
Re: Furigana
Tex asked: But does the standard address their removal by receivers (or intermediaries) , and does removing them include removing the contained annotation? Yes and yes. p. 326: On input, a plain text receiver should either preserve all characters ^^ or remove the interlinear annotation characters as well as the annotating ^^ text... I can imagine an application that doesn't support I.A. deciding the annotation is out of band and can't be preserved in its plain text output, and so justifiably strips it as well. Does the standard say what to do with for internal use only characters? Yes. Unicode 3.1: D7b: Noncharacter: a code point that is permanently reserved for internal use, and that should never be interchanged. C10: A process shall make no change in a valid coded character representation other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points, if that process purports not to modify the interpretation of that coded character sequence. The interlinear annotation characters fall in a gray zone, since they are not noncharacters, but by rights ought to have been. Since they are standard characters though, the standard has to provide some guidelines -- and it is simply safer, if you encounter and delete them, to also delete the annotation. You would be changing the interpretation of the text, but in a knowing, intended manner. I would have thought the rule was to ignore and pass along. In general, yes, as for everything else, including unassigned code points. If your role in life is as a database, for example, or some other kind of data source or data pipe, then minimal meddling with the bytes is safest. But other kinds of processes will do graduated manipulations, depending on what they are aiming for. --Ken
Re: Is U+0140 (l with middle dot) ever used?
Keld responded: On Fri, Aug 09, 2002 at 11:44:40PM +0100, Anto'nio Martins-Tuva'lkin wrote: Hm. But middle dot is not also a letter symbol. It's also used as a bullet, a tab filling, even a box-drawing char. Shouldn't Unicode provide a way to separate this duality? · has traditionally been used eg in word processors to visually display a blank character. But it was originally intended in ISO 8859-1 and other places for the Catalan language, which uses it in words such ac paral·lel. However, one cannot ignore the rest of the manifest history of this character. It also has long occurred in Code Page 437 and myriad other IBM and Microsoft Code Pages (IBM GCGID SD63) with a long history of ambiguous usage as punctuation and many other things. I think · is now listed in Unicode as a separator, and not as alphabetical. It is actually listed with General Category Po (Punctuation, Other), and not as one of the separator classes. But it also has the diacritic property and the extender property, which most punctuation characters do not. Property-based implementations can take advantage of other properties of U+00B7 to distinguish it from most punctuation. I think that is an error. How can we correct it? Changing it out of the General Category Po would disturb what by now is already a long legacy practice for many implementations. It would cause way more problems than the putative problem it is supposed to fix for Catalan. (This despite the fact that unlike the Catalan usage, which actually is more reminiscent of the delimiter usage of a middle dot, as in dictionary syl·la·bi·fi·ca·tion, there are actually quite a number of technically-based orthographies, in the Americas, at least, which use a middle dot simply as a long vowel diacritic.) Word delimitation depends on more than merely the General Category value, anyway, so appropriate word boundary determination can be developed for Catalan and other languages regardless of the General Category Po value for U+00B7. (See DUTR #29 on this.) And for identifiers, it is up to particular implementations to determine whether inclusion or exclusion of U+00B7 makes sense for their identifier syntax. What is gained for Catalan by including U+00B7 in identifiers may be offset by confusion that can set in against the usage of U+00B7 as a delimiter punctuation, or as a representation of middle dot operators in mathematical expressions. --Ken Kind regards Keld
Re: Furigana
Michael asked: At 12:11 -0700 2002-08-08, Kenneth Whistler wrote: Ah, but read the caveats carefully. The Unicode interlinear annotation characters are *not* intended for interchange, unlike the HTML4 ruby tag. See TUS 3.0, p. 326. They are, essentially, internal-use anchor points. What does this mean? That if I have a text all nice and marked up with furigana in Quark I can't export it to Word and reimport it in InDesign and expect my nice marked up text to still be marked up? Yes, among other things. Surely all Unicode/10646 characters are expected to be preserved in interchange. What have I got wrong, Ken? Your expectation that this stuff will actually work that way. Yes, the characters will be preserved in interchange. But the most likely result you will get is: anchor1textanchor2annotationanchor3 where the anchors will just be blorts. You should not expect that the whole annotation *framework* will be implemented, and certainly not that these three characters will suffice for nice[ly] marked up... furigana. These animals are more like U+FFFC -- they are internal anchors that should not be exported, as there is no general expectation that once exported to plain text, a receiver will have sufficient context for making sense of them in the way the originator was dealing with them internally. By rights, this whole problem of synchronizing the internal anchor points for various ruby schemes should have been handled by noncharacters -- but that mechanism was not really understood and expanded sufficiently until after the interlinear annotation characters were standardized. --Ken
Double Macrons on gh (was Re: Tildes on Vowels)
A propos of this long thread about display of combining macrons in Middle English, morphing from tildes on vowels: In Mozilla 2002072104, Windows XP, I get perfectly good overlines on yagh (now). I'd be interested in seeing how it looked with the combining macra. Please note that both the UTC and WG2 have approved a new set of combining double accents: U+035D COMBINING DOUBLE BREVE U+035E COMBINING DOUBLE MACRON U+035F COMBINING DOUBLE LOW LINE for various transcriptions, including common English dictionary pronunciation guide usages. Once these become available in Unicode 4.0, I believe the preferred representation to use for the gh-digraph-overlined would be: g, combining-double-macron, h Now, the question is, how long will it take for the fonts and browsers to catch up on those forms, as well?? It might make sense to start testing them now with: n, combining-double-tilde, g to see how well they do. (U+0360 COMBINING DOUBLE TILDE) --Ken P.S. I'm getting fine display of all the combining marks for the St. Erkenwald test page with MSIE 6.0 running on Windows NT 4.0 (!) with Arial Unicode MS -- only the yoghs are missing. So I'm not sure what the problem is that people are having on Windows XP.
Re: Taboo Variants
Lest everyone go scrabbling off the deep end and drown on this particular thread, I would like to point out the following facts: U+2FDF IDEOGRAPHIC TABOO VARIATION INDICATOR was accepted by the UTC on April 30, 2002. However, when the proposal was taken into WG2 it met a wall of opposition led by China. WG2 did *NOT* accept the character, and it is not a part of the FPDAM 2 currently being ballotted for inclusion in 10646. The UTC will have to deal with this mismatch (along with a number of others) in its upcoming meeting this month. China's clear preference is to simply encode all the taboo variants as separate characters. At the WG2 meeting, they pointed out a number of instances already encoded in Extension B, as you have. And with China not wanting an IDEOGRAPHIC TABOO VARIATION INDICATOR encoded, many other members of WG2 will defer to their opinion on the topic. This issue clearly needs to be worked further in the IRG context before a consensus will emerge. At any rate, don't consider it a done deal. What matters is what eventually gets published in the final, approved Amendment 2 for ISO/IEC 10646, which *will* match what we publish in Unicode 4.0. --Ken
Re: Furigana
Stefan wrote: Many Japanese word processors already have that capability. HTML4 has ruby tag exactly for that purpose. And Unicode has characters for that purpose, too. Unicode: U+FFF9 kanji U+FFFA furigana U+FFFB HTML4: RUBYRD kanji /RDRT furigana /RT/RUBY Examples: ?$B4A;z(B?$B$U$j$,$J(B? $B4A;z$U$j$,$J(B Ah, but read the caveats carefully. The Unicode interlinear annotation characters are *not* intended for interchange, unlike the HTML4 ruby tag. See TUS 3.0, p. 326. They are, essentially, internal-use anchor points. --Ken
Compatibility and Politics (was Re: Digraphs as Distinct Logical Units)
Roozbeh asked: Expecting the compatibility decompositions to serve this purpose effectively is overvaluing what they can actually do. I would love to hear your opinion about what compatibility decompositions *are* for, then. I feel a little confused here. They are helpful annotations to an earlier version of the standard that got swept up first by changing expectations and then were caught in a normative stasis trap by the normalization specification. Originally, they were a shorthand way of saying things like: This character is not really a 'good' Unicode character -- it should be thought of as a font variant of X. This character is not really a 'good' Unicode character -- it should be thought of as effectively representing the sequence of X, Y, and Z. And so on. The terminology of compatibility character confused everyone, including the people writing the standard, since it meant, on the one hand, characters that didn't really fit the Unicode text model, but which were encoded for compatibility with important standards, for ease of round-trip conversions, mostly. On the other hand, it came to mean characters that had compatibility decompositions, once those were officially specified in the Unicode 2.0 publication, since most compatibility characters had compatibility decompositions. This situation was further confused by the abortive early attempt to encode compatibility characters in a compatibility zone, which resulted in people assuming that if a character was in that zone it automatically *was* a compatibility character and (later) that it should also have a compatibility decomposition. However, compatibility decompositions were originally assigned pretty much by a seat-of-the-pants method, without a clear implementation model to guide all of the decisions. As the UTC approached the critical milestone of Unicode 3.0 (and normalization), many of the earlier decompositions were refined and further rationalized, but they still retained some of the helter-skelter context of their annotational origins. The intuition was that the compatibility decompositions sort of made sense for such things as fallback, loose comparison (e.g. for collation and searching), normalizing, and such. However, when detailed specifications started to be written for such things, guided by implementation experience, it turned out that the compatibility decompositions were typically in the ballpark, as it were, but not correct in detail for any one purpose, let alone all purposes. And the publication of UAX #15 Normalization drastically turned things on their head. Instead of being annotational, and fixable, compatibility decompositions became part of the normative definition of NFKD and NFKC, and became unfixable, because of the requirements of normalization stability. So post-Unicode 3.0, the right way to think of the compatibility decomposition mappings is as the normative data used to define NFKD and NFKC. They bear some resemblance to relationships between characters and character sequences that may be useful in other processes, but in *all* cases should not be taken as a sufficiently precise set of classifications and equivalences for other processes -- there are always going to be exceptions, particularly since compatibility decompositions can no longer be fixed as a result of tuning based on implementation experience. providing backup rendering when they lack the glyph, This seems unlikely to be particularly helpful in this *particular* case. Believe me, it really is. I'm implementing char-cell rendering for Arabic terminals, and when it comes to Arabic ligatures, since I don't want two get into a mess of double width things, I just decompose than ligature, and render the equivalent string. It's not as genuine as may be, but it's automatic, simple, clean, and conformant. For this kind of application, then, you simply add on decompositions for whatever else cannot be conveniently rendered in a char-cell. Arabic terminal applications have often already departed from what the Unicode Standard specifies in the way of compatibility decompositions by doing special handling of character tails in a separate cell, for example. Note that there isn't any compatibility mapping for U+FEB1 (isolated seen) -- U+FEB3 (initial seen) + U+FE73 (tail fragment), even though that might be what a Arabic terminal could do for display. It isn't non-conformant with the Unicode Standard to transform Unicode characters to alternate representations -- such as a glyph stream for terminal rendering -- it would only be nonconformant to *claim* that such a glyph stream is NFKD data when it departs from that specification. Some other point: We like to discourage the usage of Arabic Presentation Forms, don't we? Of course. They are compatibility characters for working with the existing legacy code pages that encoded Arabic that way. That is mentioned in TUS 3.0 at the end of the chapter about Arabic. All the
Re: Digraphs as Distinct Logical Units
At 04:48 PM 02-08-02, Kenneth Whistler wrote: ... and some extreme case orthographies are known that employ up to *hepta*graphs! Ooo, I want one! Do you have any examples, Ken? If I recall correctly, that one was a technical orthography of Nama -- but I can't track down an online reference at the moment. In the meantime, for a sampler of some of the wild multigraphs used in various orthographies for Khoi and San languages, try http://www.african.gu.se/khsnms.html Examples: '//Ng -- there's a pentagraph for you. //Kx', //Kh' and so on. -- //Kh'en P.S. The San peoples are now apparently vigorously objecting to being lumped with the Khoi peoples as Khoisan. See: http://allafrica.com/stories/200104270244.html
Re: Missing character glyph- example
As a clarification, here is a sample web page: http://www.cardbox.com/missing.htm The requirement is to be able to display the first paragraph of the page in such a way that it makes sense in its reference to the text on the rest of the page. The character after the word this: in the first paragraph cannot be reliably represented by any existing Unicode character. Nevertheless, I believe it is legitimate to want to say what the first paragraph says. Well, I would put it differently, if it were my web page. Rather than: quote If any of the following text contains characters such as this: {blort} then please change to a different font, or download a more recent version of your current font. /quote I would suggest something more along the line of: quote If you have trouble displaying any of the characters in the text on this page, please consult a href=xxx.html Troubleshooting Display Problems/a. /quote Then the troubleshooting page could provide a nice explanation of the problem, show several neatly formatted *graphics* of the kind of nondisplayable glyph issues (with alternate forms picked from various fonts) that a user might run into, and then give helpful links to actual font resources that would help, or in the case of specialized data, actually provide a usable font directly. Such an approach: A. Avoids font-specific circularity in your attempt to explain to a user what is going on when the display is broken. B. Provides much more useful information that will actually have a better chance of helping the user get by the problem. Also, since the problem(s) may not only be some nondisplayable glyphs, the approach is extensible for whatever display help is needed. C. Doesn't depend on dubious assignments of a code point in Unicode for a confusing (non-)use. But if you insist on having a code point to stick directly in a sentence like that above, I'd take the cue from James Kass: The missing glyph is the first glyph in any font. This is mapped to U+ and the system correctly substitutes the glyph mapped to U+ any time a font being used lacks an outline for a called character. Thus, you have a reasonably good chance that if you try to purposefully display the character U+, you will get the missing glyph for the font in use. (Unless the application is filtering out NULL characters.) --Ken
Re: Missing character glyph
Asmus wrote: At 08:40 PM 7/30/02 -0700, Doug Ewell wrote: a code-point that has no character assigned to it (and is not likely to get one), e. g. U+03A2 No code point is safe. True enough. But then I figure Plane 13 characters like U+DEAD1 are pretty unlikely to be assigned to a character in our lifetimes (or our children's lifetimes). That one is *reasonably* safe to use as an example. ;-) --Ken *remembers when he used to use 0xdeadbeef as a magic number in tests because it was easy to spot in hex displays* A./
Re: REALLY *not* Tamil - changing scripts (long)
It's *much* easier -- and, in the long term, safer -- for them to select from the extensive inventory of characters available in Unicode and to avoid using ASCII punctuation characters with redefined word-building semantics. I don't get what you are saying here, why should people be limited to ASCII punctuation characters? That isn't what Peter was saying. You are confused here by your misinterpretation of what he was saying. The recommendation that Peter was making is that people devising orthographies for languages should stick to Unicode letters for the letters of their orthography. (If the script in question is Latin, as most new orthographies are, then there are *hundreds* of Latin letters to choose from in the standard.) What orthography developers should avoid is using characters like 7 ! $, ' and so on as letters of their orthography, since those are certain to cause all kinds of havoc with word-break and other processes for standard software -- or even lead to the kind of absurdities as people wanting illegal constructs like: 'jo'Abr@cd@br.com, which locales can*not* fix. Just as choices about rational orthographies used to have to take ease of use on typewriters as a major factor involved (to fail to do so would be to condemn legions of people to wretched inefficiency) -- so choices about new rational orthographies should now being taking ease of use on computers as a major factor involved. That is just a realistic approach that any *serious* deviser of an orthography should be taking into account. With GNU libc you can declare your own set of punctuation characters in the locale, and they can be any 10646 character. Peter was talking about the opposite case. But you should examine carefully what the implications are of your suggestion here. If I were to make the absurd choice of picking 18 Chinese characters to serve as my punctuation characters, and then went through the exercise of declaring my own locale with GNU libc, I would only be guaranteeing that my locale (and all my text data) would only function correctly in a microscopic environment that I defined (or could browbeat a few others to share). The reason for sticking to the Universal Character Set and for sticking to standardized properties for the characters in that set is to guarantee widespread interoperability and to guarantee that my text, in my language, works correctly in all off-the-shelf software -- not merely in my own hacked-up locale. Serious orthography designers should not allow themselves to get stuck in such dead-end traps. --Ken Or are you referring to the specific locale syntax from POSIX/TR 14652? Kind regards Keld
Re: REALLY *not* Tamil - changing scripts (long)
Keld wrote: In Linux, *Which* Linux? :-) Caldera OpenLinux, Corel Linux, Debian GNU/Linux, Elfstone Linux, Libranet Linux, Linux-Mandrake, Phat Linux, Red Hat Linux, Slackware Linux, Stampede GNU/Linux, Storm Linux, SuSE Linux, or TurboLinux? Or for that matter another dozen international distribution Linuxes, or a half-dozen on the Macintosh? for a specific locale, it is relatively easy to get the new locale to work on all off-the-shelf software: you need to write the locale, and submit it to the glibc people, but then - in about 6 months or so, it would be available on all mainsteam new Linux distributions, off the shelf. While most of the Linuxes do make use of GNU/C, they don't all do so at the same levels or with the same versions of glibc, and certainly not all at the same times. And all applicatuions would adhere to it, given Linux' advanced i18n technology. I think this is talking through your hat at bit. Do you think that Adobe Acrobat Reader 4.0 PDF viewer on Linux-Mandrake is going to just automatically pick up an Ethiopic locale setting because I happened to submit a locale proposal to the glibc people 6 months earlier. I don't think so. --Ken
Re: (long) Making orthographies computer-ready (was *not* Telephoning Tamil)
One that occurs to me might be the Khoisan languages of Africa, which I believe commonly use ! (U+0021) for a click sound. This is almost exactly the same problem you are describing for Tongva. U+01C3 LATIN LETTER RETROFLEX CLICK (General Category Lo) was encoded precisely for this. It is to be *distinguished* from U+0021 '!' EXCLAMATION MARK to avoid all of the processing problems which would attend having a punctuation mark as part of your letter orthography. A Khoisan orthography keyboard should distinguish the two characters (if, indeed, it makes any use at all of the exclamation mark per se), so that users can tell them apart and enter them correctly. --Ken
God's and devil's details (was: Re: Unicode certification - quote correction and attribution)
[Tex Texin] Actually, (or so I have heard) it is God dwells in the details of our work, I have seen it attributed to Einstein, more generally to shakers, and others. So Ludwig might have been quoting others. [Ken Whistler] And the devil is in the details. Looking a bit at your suggestions, [James Agenbroad] No, God is in the details Ludiwg Mies van der Rohe (1886-1969) said. And the Word Court rules: http://www.theatlantic.com/issues/2000/01/001wordcourt.htm And since I'd rather be associated with the likes of Einstein, Flaubert, and van der Rohe than Nitze, Reagan, and Perot, maybe I'll shift back to God is in the details. --Ken Der lieber Gott lebt im Detail. Le bon Dieu est dans le detail. And that's the beauty of Unicode IMHO.
Re: God's and devil's details (was: Re: Unicode certification - quote correction and attribution)
The correct Einsteinian German appears to be: Der liebe Gott steckt im Detail (cf. http://www.benecke.com/einsteinprogramm.html) (and there are German alternatives such as Gott lebt im Detail) and the satanic alternate is: Der Teufel liegt im Detail (very common, actually, but maybe just calqued from English) Who knows, maybe the concepts were borrowed from Latin to begin with, anyway. And as we can see from this thread God and the Devil do seem to be in the details! --Ken
Re: Abstract character?
Following up on several responses on this thread. Mark Davis said: A small correction to Ken's message: The Unicode scalar value definitionally excludes D800..DFFF, which are only code unit values used in UTF-16, and which are not code points associated with any well-formed UTF code unit sequences. The UTC in has decided to make scalar value mean unambiguously the code points ..D7FF, E000..10, i.e., everything but surrogate code points. Correct. While surrogate code points cannot be represented in UTF-8 (as of Unicode 3.2), the UTC has not decided that the surrogate code points are illegal in all UTFs; notably, they are legal in UTF-16. Not to pick nits here... oh well, o.k., I'll pick nits. I stated that D800..DFFF ... are not code points associated with any well-formed UTF code unit sequence. I believe, as stated, that that is correct. An isolated surrogate in UTF-16 is *not* a well-formed UTF code unit sequence. Even by the disputed text of Unicode 3.0, an isolated surrogate code unit in UTF-16 would be an irregular code value sequence. And with the updated relevant text in Unicode 3.2, I think there is even less wiggle-room. The last vestige of irregular code unit sequence vanished in Unicode 3.2 when the loophole for UTF-8 was closed. The Unicode 3.2 standard now reads: Terminology to distinguish ill-formed, illegal, and irregular code unit sequences is no longer needed. There are no irregular code unit sequences, and thus all ill-formed code unit sequences are illegal. It is illegal to emit or interpret any ill-formed code unit sequence. Unicode 4.0 will revise the terminology and conformance clauses in light of this. Ken is pushing for this change; I believe it would be a very bad idea. I believe it is a worse idea to carry forward the claim that (isolated) surrogate code points cannot be represented in UTF-8 (as is definitely the case for Unicode 3.2) while they can be represented in UTF-16. (I think the reasons have already appeared on this list, so I am not trying to reopen the discussion; just state the current situation.) Doug Ewell followed up: UTF-16 does not allow the representation of an unpaired surrogate 0xD800 followed by another, coincidental unpaired surrogate 0xDC00. (It maps the two to U+1.) Among the standard UTFs, only UTF-32 allows the two to be treated as unpaired surrogates. Actually, not that, either. In fact, before UTF-8 was tightened up in 3.2, the only UTF that DID NOT permit these two coincidental unpaired surrogates was UTF-16. UTF-8: D800 DC00 == ED A0 80 ED B0 80 (no longer legal) UTF-32: D800 DC00 == D800 DC00 This is ill-formed in UTF-32, and thereby, illegal. - but - UTF-16: D800 DC00 == D800 DC00 == 1 David Hopwood responded: I think it would be a mistake for the standard to refer to surrogate code points. I think this was already definitely decided by the UTC. The term code point is used for other CCS's where there may also be gaps in the code space; in that case, the gaps are not considered valid code points. I am sympathetic with this point of view, but it isn't easy to draw such a line in practice. Look at the various Asian DBCS sets -- they often had ranges of byte values that were considered invalid as parts of encoded characters, and if you mapped them out to an integral space, you would end up with ranges of integers that were invalid as code points. But when push came to shove, various of these encodings just appropriated some of these ranges to extend themselves, and filled them with more characters. What was an invalid code point became a valid (and assigned) code point. When 0xD800..0xDFFF are used in UTF-16, they are used as code units, not code points. As Unicode code points, 0xD800..0xDFFF are (or at least should be) invalid in the same sense that 0x11 is. As Unicode code points they are invalid in a different sense than 0x11 is, actually. 0x11 could, by the integral transforms involved, be represented by UTF-8 or by UTF-32, but not by UTF-16. 0xD800 could, in principle, be represented by UTF-16, if you allowed the range, but is ruled to be ill-formed in all three UTF's, to avoid the kinds of irregular sequences that the UTC was just at pains to eliminate. I.e. IMHO Unicode scalar value and Unicode code point should be synonyms, with the set of valid values 0..0xD7FF, 0xE000..0x10. I think the distinction in ranges is a useful one, since it allows for a bijective definition of the UTF's, based on the Unicode scalar value, but it also gives a meaning to the complete integral range for the code points, as demanded by some of the implementers. code point should be defined as an integer corresponding to an encoded character in any CCS, not just Unicode. This doesn't really work, since it doesn't account for the unassigned (reserved) code points, nor the noncharacters. The Unicode architecture for its codespace is
Re: Abstract character?
Lars Marius Garshol asked: I'm trying to find out what an abstract character is. I've been looking at chapter 3 of Unicode 3.0, without really achieving enlightenment. The term Unicode scalar value (apparently synonymous with code point) seems clear. It is the identifying number assigned to assigned Unicode characters. Here is one of my attempts at a more rigorous term rectification: Abstract character that which is encoded; an element of the repertoire (existing independent of the character encoding standard, and often identifiable in other character encoding standards, as well as the Unicode Standard); the implicit basis of transcodings. Note that while in some sense abstract characters exist a priori by virtue of the nature of the units of various writing systems, their exact nature is only pinned down at the point that an actual encoding is done. They are not always obvious, and many new abstract characters may arise as the result of particular textual processing needs that can be addressed by characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER, etc., etc.) Code point A number from 0..10; a point in the codespace 0..10. Encoded character An *association* of an abstract character with a code point. Unicode scalar value A number from 0..D7FF, E000..10; the domain of the functions which define UTF's. The Unicode scalar value definitionally excludes D800..DFFF, which are only code unit values used in UTF-16, and which are not code points associated with any well-formed UTF code unit sequences. Assignment (of code points) Refers to the process of associating abstract character with code points. Mathematically a code point is assigned to an abstract character and an abstract character is mapped to a code point. This is distinguished from the vaguer sense of assigned in general parlance as meaning a code point given some designated function by the standard, which would include noncharacters and surrogates. So far, so good. Some questions: - are all assigned Unicode characters also abstract characters? Yes. Or rather: all encoded characters are assigned to abstract characters. (See above for my distinction between assigned and designated, which would apply to noncharacters and surrogate code points -- neither of which classes of code points get assigned to abstract characters.) - it seems that not all abstract characters have code points (since abstract characters can be formed using combining characters). Is that correct? Yes. (Note above -- abstract characters are also a concept which applies to other character encodings besides the Unicode Standard, and not all encoded characters in other character encodings automatically make it into the Unicode Standard, for various architectural reasons.) - do U+00C5 (Å) and U+0041, U+030A (A followed by combining ring above) represent the same abstract character? Yes. That is the implicit claim behind a specification of canonical equivalence. --Ken Would be good if someone could clear this up. -- Lars Marius Garshol, Ontopian URL: http://www.ontopia.net ISO SC34/WG3, OASIS GeoLang TCURL: http://www.garshol.priv.no
Re: ISO/IEC 10646 versus Unicode
Marion Gunn wrote: The immediate attraction ang great advantage of Unicodes vision was its simplicity/focus: after an unsteady and argumentative start, its founders committed Unicode to the IMPLEMENTATION of10646, and became very specific (loud) about not calling it a STANDARD (note to newcomers - check out the archives of the relevant lists). Well, I'm one of the founders, and I don't recall this particular dichotomy, certainly not LOUDLY stated. I dug around for awhile in my own collection of 1989 - 1993 email, and didn't find any obvious such claims, although I could well have missed someone's assertion. Perhaps you can cite an example of what you are talking about. In any case, the existence of the Unicode Standard, published as a *standard* in 1991, with Volume 2 in 1992, clearly self-proclaiming its status as a standard, would seem to belie your claim. Read the text -- even in Volume 2 of Unicode 1.0, published while the merger was underway, and containing a number of pages devoted to the details of how the repertoire of the Unicode Standard was synched with the then to-be-published ISO/IEC 10646-1:1993, the Unicode Standard didn't proclaim that it was merely an implementation of 10646. Sample of that text: ... These additional elements do not create incompatibility between the Unicode standard and ISO DIS 10646. They are summarized here in order to clarify the relationshp between the two standards... While ISO 10646 contains no means of explicitly identifying or 'declaring' Unicode values as such, the Unicode standard may be considered as encompassing the entire repertoire of 10646 and having the following profile values: ...-- p. 3 I expected the ad hoc Uncode consortium itself to voluntarily disband in 3-5 years (wrong again) having successfully fulfilled its brief of producing implementations of 10646 with flying colours (again wrong, as it has yet to do that). I think this is a misunderstanding of the self-understood brief of the Unicode Consortium. It was ad hoc, certainly, but its purpose was not producing implementations of 10646. The original Purpose of the Unicode Consortium, as stated in the Bylaws filed in the Articles of Incorportation of the corporation on January 3, 1991 was: This Corporation's purpose shall be to standardize, maintain and promote a standard fixed-width, 16-bit character encoding that provides an allocation for more than 60,000 graphics characters. That was two years *before* ISO/IEC 10646-1:1993 was published. To reflect changing reality, following the publication of the Unicode Standard and the introduction of encoding forms (UTF-*), the Bylaws have subsequently been amended to: This Corporation's purpose shall be to extend, maintain and promote the Unicode Standard. This was and is quite clear. The Unicode Consortium is a standardization organization, and its activities revolve around the care and support of the Unicode Standard. It never has been a group just dedicated to figuring out how to implement 10646. but that does not mean any withdrawal of EGTs initial and longstanding support of Unicode, in principal (although it seems to have produced only one thing to date, viz., a book called The Unicode Standard (where I expected to read Implementation). See above. --Ken
Re: Basic question: types of diacritics marks
Adam asked: I have a very basic question. What would be the implementation differences of diacritics marks in a font? For example, we'd consider: U+00B4 acute accent U+02CA modifier letter acute accent U+0301 combining acute accent What are the common recommendations regarding the glyphs in a font (TrueType), especially with respect to the metrics? Should I support all three above codepoints? If so, can I do this with one glyph? Or should I provide separate glyphs? To elaborate on what Michael Everson said, I think the answer here is that you should probably provide separate glyphs. U+00B4 would typically have the spacing width of an en, thereabouts, since it is the spacing clone of a combining mark acute, and on average, you would expect it to have a en character width. It also gets used for fallback displays, as for Latin-1 `curly´ quotes using grave and acute instead of real quotation marks (an extension of ASCII `curly' quotes using grave and apostrophe), for primes in character sets that don't really have one (also as an alternate to apostrophe) [cf. U+2032], for email-type indication of accents on l´et´t´ers´ that you don't have actual codes for, and the like. So you need to make it look appropriate for such uses. U+02CA should typically be a little narrower (I think). It really is a modifier letter intended to precede or follow a regular letter, usually indicating a tone or stress for a syllable (as an alternative to the acute actually placed over a letter in the same function). And U+0301 needs to be rendered over letters. Its exact placement will depend on the width and height of the letter it is placed over. Of course, your mileage may vary, depending on what you are trying to do with your font design. And John Hudson provided the technical details regarding what happens inside the font. And, briefly, what are the principal differences between the three types of marks? Michael Everson answered this one in terms of functionality. --Ken
What Unicode Is (was RE: Inappropriate Proposals FAQ)
Suzanne responded: Maybe Unicode is more of a shared set of rules that apply to low level data structures surrounding text and its algorithms then a character set. Sounds like the start of a philosophical debate. If Unicode is described as a set of rules, we'll be in a world of hurt. (On a serious note, these exceptions are exactly what make writing some sort of is and isn't FAQ pretty darned hard. Hmm. Since the discussion which started out trying to specify a few examples of what kinds of entities would be inappropriate to proffer for encoding as Unicode characters seems to be in danger of mutating into the recurrent What is Unicode? question, perhaps its time to start a new thread for the latter. And now for some ontological ground rules. When trying to decide what a thing is, it helps not to use an attribute nominatively, since that encourages people to privately visualize the noun the attribute is applied to, but to do so in different ways -- and then to argue past each other because they are, in the end, talking about different things. Unicode is used attributatively of a number of things, and if we are going to start arguing/discussing what it is, it would be better to lay out the alternative its a little more specifically first. 1. The Unicode *Consortium* is a standardization organization. It started out with a charter to produce a single standard, but along the way has expanded that charter, in response to the desire of its membership. In addition to The Unicode Standard, it now has adopted a terminology that refers to some of its other publications as Unicode Technical Standards [UTS], of which two formally exist now: UTS #6 SCSU, and UTS #10 Unicode Collation Algorithm [UCA]. It is important to keep this straight, because some people, when they say Unicode are talking about the *organization*, rather than the Unicode Standard per se. And when people talk about the standard, they are generally referring to The Unicode Standard, but the Unicode Consortium is actually responsible for several standards. 2. The Unicode *Standard* itself is a very complex standard, consisting of many pieces now. To keep track of just what something like The Unicode Standard, Version 3.2 means, we now have to keep web pages enumerating all the parts exactly -- like components in an assemble-your-own-furniture kit. See: http://www.unicode.org/unicode/standard/versions/ In any one particular version, the Unicode Standard now consists of a book publication, some number of web publications (referred to as Unicode Standard Annexes [UAX]), and a large number of contributory data files -- some normative and some informative, some data and some documentation. These definitions, including the exact list of contributory data files and their versions, are themselves under tight control by the Unicode Technical Committee, as they constitute the very *definition* of the Unicode Standard. It is not by accident that the version definitions start off now with the following wording: The Unicode Standard, Version 3.2.0 is defined by the following list... and so on for earlier versions. 3. The Unicode *Book* is a periodic publication, constituting the central document for any given version of the Unicode *Standard*, but is by no means the entire standard. The book, in turn, is very complex, consisting of many chapters and parts, some of which constitute tightly controlled, normative specification, and some of which is informative, editorial content. The book now also exists in an online version (pdf files): http://www.unicode.org/unicode/uni2book/u2.html which is *almost* identical to the published hardcover book, but not quite. (The Introduction is slightly restructured, the online glossary is restructured and has been added to, the charts are constructed slightly differently and have introductory pages of their own, etc.) 4. The Unicode *CCS* [coded character set] is the mapping of the set of abstract characters contained in the Unicode repertoire (at any given version) to a bunch of code points in the Unicode codespace (0x..0x10). Technically speaking, it is the Unicode *CCS* which is synchronized closely with ISO/IEC 10646, rather than the Unicode *Standard*. 10646 and the Unicode CCS have exactly the same coded characters (at various key synchronization points in their joint publication histories), but the *text* of the ISO/IEC 10646 standard doesn't look anything like the *text* of the Unicode Standard, and the Unicode Standard [sensum #2 above] contains all kinds of material, both textual and data, that goes far beyond the scope of 10646. There are other standards produced by some national bodies that are effectively just translations of 10646 (GB 13000 in China, JIS X 0221 in Japan), but the Unicode Standard is nothing like those. Finally, the attribute Unicode ... can be applied to all kinds of other things characteristic of the Unicode Standard, including algorithms for the
Re: Hmm, this evolved into an editorial when I wasn't looking :) was: RE: Inappropriate Proposals FAQ
Barry Caplan wrote: At 01:27 PM 7/11/2002 -0400, Suzanne M. Topping wrote: Unicode is a character set. Period. Each character has numerous properties in Unicode, whereas they generally don't in legacy character sets. Each character, or some characters? For all intents and purposes, each character. So, each character has at least one attribute. Yes. The implications of the Unicode Character Database include the determination that the UTC has normatively assigned properties (multiple) to all Unicode encoded characters. Actually, it is a little more subtle than that. There are some properties which accrue to code points. The General Category and the Bidirectional Category are good examples, since they constitute enumerated partitions of the entire codespace, and API's need to return meaningful values for any code point, including unassigned ones. Other properties accrue more directly to characters, per se. They attach to the abstract character, and get associated with a code point more indirectly by virtue of the encoding of that character. The numeric value of a character would be a good example of this. No one expects an unassigned code point or an assigned dingbat character or a left bracket to have a numeric value property (except perhaps a future generation of Unicabbalists). There are no corresponding features in other character sets usually. Correct. Before the development of the Unicode Standard, character encoding committees tended to leave that property assignments either up to implementations (considering them obvious) or up to standardization committees whose charter was character processing -- e.g. SC22/WG15 POSIX in the ISO context. The development of a Universal character encoding necessitated changing that, bringing character property development and standardization under the same roof as character encoding. Note that not everyone agrees about that, however. We are still having some rather vigorous disagreements in SC22 about who owns the problem of standardization of character properties. A common definition of character set is a list of character you are interested in assigned to codepoints. That fits most legacy character sets pretty well, but Unicode is sooo much more than that. Roughly the distinction I was drawing between the Unicode CCS and the Unicode Standard. But what if we took a look at it from a different point of view, that the standard is a agreed upon set of rules and building blocks for text oriented algorithms? Would people start to publish algorithms that extend on the base data provided so we don't have to reinvent wheels all the time? Well the Unicode Standard isn't that, although it contains both formal and informal algorithms for accomplishing various tasks with text, and even more general guidelines for how to do things. The members of the Unicode Technical Committee are always casting about for areas of Unicode implementation behavior where commonly defined, public algorithms would be mutually beneficial for everyone's implementations and would assist general interoperability with Unicode data. To date, it seems to me that the members, as well as other participants in the larger effort of implementing the Unicode Standard, have been rather generous in contributing time and brainpower to this development of public algorithms. The fact that ICU is an Open Source development effort is enormously helpful in this regard. If I were to stand in front of a college comp sci class, where the future is all ahead of the students, what proportion of time would I want to invest in how much they knew about legacy encodings versus how much I could inspire them to build from and extend what Unicode provides them? This problem, of Unicode in the computer science curriculum, intrigues me -- and I don't think it has received enough attention on this list. One of my concerns is that even now it seems to be that CS curricula not only don't teach enough about Unicode -- they basically don't teach much about characters, or text handling, or anything in the field of internationalization. It just isn't an area that people get Ph.D.'s in or do research in, and it tends to get overlooked in people's education until they go out, get a job in industry and discover that in the *real* world of software development, they have to learn about that stuff to make software work in real products. (Just like they have to do a lot of seat-of-the-pants learning about a lot of other topics: building, maintaining, and bug-fixing for large, legacy systems; software life cycle; large team cooperative development process; backwards compatibility -- almost nothing is really built from scratch!) The major work ahead is no longer in the context of building a character standard. Time is fast approaching to decide to keep it small and apply a bit of polish, or focus on the use and usage of what is already there in Unicode by those who
RE: Saying characters out loud (derives from hash, pound,octothorpe?)
Joe sent around a classic version of Waka waka bang splat, but my favorite is a slightly pared-down version set to music for a four-part round, lyrics by Fred Bremmer and Steve Kroese, music by Melissa D. Binde: http://www.roundsing.org/music/waka-waka.html where you can listen to it in it's multipart beauty. roundsing.org has other classics such as: I eat my peas with honey, I've done it all my life. It makes the peas taste funny But it keeps them on my knife. --Ken
Re: *Why* are precomposed characters required for backward compatibility?
Dan Oscarsson said: NFD should not be an extension of ASCII. There are several spacing accents in ASCII that should be decomposed just like the spacing accents in ISO 8859-1 are decomposed. All or none spacing accents should be decomposed. In addition to the usage clarifications made by John Cowan and David Hopwood, I should point out a little history here. As of Unicode 2.0, some compatibility decompositions were still provided for U+005E CIRCUMFLEX ACCENT, U+005F LOW LINE, and U+0060 GRAVE ACCENT, along the lines suggested by Dan. However, when normalization forms were being established and standardized in the Unicode 3.0 time frame, it became obvious that these particular compatibility decompositions would lead to trouble. Any Unicode normalization form that would not leave ASCII values unchanged would have been DOA (dead on arrival), because of its potential impact on widely used syntax characters in countless formal syntaxes. The equating of U+005F LOW LINE with a combining low line applied to a SPACE was particularly problematical, since LOW LINE is so widely accepted as an element of identifiers. Because of these complications, the 3 compatibility decompositions were withdrawn by the UTC (unanimously, if I recall correctly), *before* the normalization forms were finally standardized. Consistency in treatment would be nice, but consistency in treatment of the multiply ambiguous ASCII characters of this ilk is impossible at this point. And it would have been very, very, very, vry bad for normalization to have allowed these three, in particular, to have decompositions. --Ken
Re: Variant selectors in Mongolian
Martin Heijdra asked: The statement For example, in languages employing the Mongolian script, sometimes a specific variant range of glyphs is needed for a specific textual purpose for which the range of generic glyphs is considered inappropriate could be taken to mean this solution. Correct. However, the Mongolian table is very glyph-based, and says The valid combinations are exhaustively listed and described in the following table. It seems to imply that medial dotted n is ALWAYS denoted by n-/ (as is undotted initial n). That is, regular ana (dotted) would be a-n-/-a, regular anda would a-n-d-a (undotted), irregular aNa would be encode a-n-a (undotted), and irregular aNda (dotted) would be a-n-/-d-a. That is, there would be regular formations marked with the variant selector, and irregular ones unmarked. No, I don't think that is the intent for Mongolian. Which of the two cases is meant by Unicode? Mongolian variants *are* very confusing, and I'm not sure what the best way to describe them is. Part of the problem is that there is still some tension in the UTC regarding just how to define the affect of the variation selectors. Position A: A variation selector selects a particular, defined glyph. That position would, for Mongolian, tend to support your second interpretation. However, ... Position B: A variation selector selects a variant form of a character, which has a distinct rendering from that specified for the character without a variant specification. When applied to Mongolian (or in principle any script like Mongolian), where a character is subject to positional shaping rules, you have to consider that character X is associated with, for example, a *set* of glyphs X - {G1, G2, G3, G4} depending on positional contexts. A variant of character X might be associated with a variant *set* of glyphs, some of which could overlap, e.g. X-/ -- {G1, G2', G3', G4}, so that the glyphs for the variant might not contrast in all positional (or other) contexts. The reason the variation selectors were encoded in the first place for Mongolian, I believe, was to try to preserve an Arabic-like model, where the base character would get a character encoding, and it would then be mapped to positionally determined glyphs. But exceptional patterns of that positional determination required some method of marking. The alternative which people saw would have been to just encode all the glyphs: G1, G2, G2', G3, G3', G4, in the above example -- and that approach would have radically departed from the model of how Unicode should encode text. It also would have significantly further complicated Mongolian text processing, it seems to me, since distinct letters, in some positions, have glyphic neutralizations. (Not that it is easy, anyhow!) --Ken
Re: Definition of character: Exegesis of SC2 nomenclature
Martin Kochanski waxed exuberantly: I mention this because Unicode is the opposite of Procrustean. There is no finer antidote to gloom and cynicism than leafing through the Unicode Standard. In what other computing book could you find a phrase such as In good Latvian typography? Or: The king's primary purpose was to bring Buddhism from India to Tibet ? and Character Most Resembling a Frog (this is left as an exercise for the reader). Telugu U+0C0A. But then, perhaps I had an unfair start. ;-) --Ken
Re: Variant selectors in Mongolian
John Hudson wrote: Mongolian variants *are* very confusing, and I'm not sure what the best way to describe them is. Part of the problem is that there is still some tension in the UTC regarding just how to define the affect of the variation selectors. Position A: A variation selector selects a particular, defined glyph. That position would, for Mongolian, tend to support your second interpretation. However, ... Position B: A variation selector selects a variant form of a character, which has a distinct rendering from that specified for the character without a variant specification. The inclusion of variant selectors in Unicode uncomfortably blurs the line between character processing and glyph processing. True enough. But they are an attempt to keep a finger in the dike of outright glyph encoding. If you think of the problem with Han variants, you can see that allowing those dike leaks to crumble the dike could result in a veritable inundation of the character encoding with essentially useless alternate forms that would only serve to further blur the line. Or to extend the metaphor, the ground beneath our feet would be so softened, we'd always be trudging around hipdeep in the mud for CJK. The only excuse I can think of for including glyph substitution triggers in plain text is if there are normative stylistic substitutions to be identified by an author as a regular aspect of the writing of a given script, i.e. Ken's Position A. If you are not going to specify what the variant is, what point is there to including the glyph subsitution trigger in plain text, since you have no idea what the outcome is going to be in any given font? Actually, I think Position B is a coherent one for Mongolian. The outcome *is* specified -- it is just specified for particular positional contexts, rather than for a single glyph per se. X - {G1, G2, G3, G4}, where Gn is determined by positional (or other) context. X-/ - {G1, G2', G3', G4}, where Gn is determined by positional (or other) context. is still determinate, and not contingent on fonts. (Although, of course, if you use fonts that don't have the glyphs G1, G2, G2', G3, G3', G4, or software that can't do the mapping correctly, you are hosed.) It is just more complicated, but fully as determinate as: X - G X-/ - G' The value of the variant selector to the user is in knowing what the result is going to be, and this means that the variant form *must* be specified. It is. See above. How else can the variant selector be used to *select* a particular form? Selection implies a deliberate choice, not a willingness to accept any substitution a font might provide. I agree. Although variation selectors also imply willingness to accept fallback to default glyphs as legible alternatives, if not the desired alternatives. --Ken John Hudson
Strange resemblances and weird sisters
Then there is the oft-cited Character Most Resembling a Line Break: MALAYALAM LETTER UU (U+0D0A) Then in Extension B there are many, many weird and wonderful candidates for strangest CJK characters. Some of my personal favorites include: U+26B99 U+20137 U+20572 U+2069C U+2696E With such genetic defects, one would have expected such characters to die out long ago, but Unicode has brought them back to life. And of course, there is always the miraculous proliferation of turtles... ;-) --Ken
RE: Phaistos in ConScript
Michael, Ken. Thanks for your response. Hmm. I think I detect the invisible ironic smiley there. Thanks for broadcasting my private, poke-in-the-ribs response to you and Marco back to the public list. ;-) As I said, the original might (assuming a syllabic structure and assigning random syllable values) well be LABUGIDANO, but when reversed it might read NODAGIBULA which could be a valid linguistic sequence. OK, so reading the whole text you would come up with readings which wouldn't make sense, so you would have to start over with a different directionality. Given the practice of the other scripts in the region, I consider this unlikely given its impracticality. True enough. The people who used scripts with multiple directionalities did reverse the glyphs when reversing the directionality. The inherent directionality of Phoenician BETH or of PLUMED-HEAD or of Egyptian WN (the bunny rabbit) lends itself to the use of such glyph-indicated directionality for text in general. I would not assume, additionally, that the Phaistos script would always be written on disks in spiral formatting. That too would be unlikely and impractical, would it not? Indeed. But what seems to be missing here is even demonstration that we are dealing with a general use script that might be written in other contexts. With only one instance -- and that written on a disk in spiral formatting -- how do you know? I think you may be sticking your neck out rather far (to the left) on this one. I am inclined to agree with Marco about the issue for presentation. Why should you innovate over Godart here in this *particular* instance, based on so little evidence. Because I suspect that Godart might well agree with me -- I don't imagine that he ever considered this aspect of text presentation. And because it makes sense given the context of other scripts in the region. You could be right, but then you could be wrong, too. So could Godart! He was describing the disk, not thinking about encoding and presenting it! I'm not saying what you are doing is unreasonable -- but it is not demonstrably uncontroversial. Well that's my opinion anyway. I suppose we could try to contact Godart and ask his opinion. Sounds like a good idea to me. It's not as though the CSUR is normative True enough. And if you want to get into the fray with all the various and sundry decipherers of the disk, and teach them all to use mirrored glyphs in LTR representations of Phaistos material, then who's to stop you? And after all, there must be several orders of magnitude more instances of Phaistos characters in the secondary literature by now than there are in the primary corpus! --Ken -- Michael Everson *** Everson Typography *** http://www.evertype.com
Re: Phaistos Disk
Michael, At 10:58 -0400 2002-07-05, Patrick Rourke wrote: There is also the question of what kind of text it represents: is it a prose text, is it a catalogue of items (the other Aegean scripts tempt one to suspect this), each item represented by an ideograph, etc.? Well if you look at it you find patterns and repetition in the phrases divided up. It is most likely an actual text. The script is probably syllabic, as syllabic scripts were common back then, So were epic oral storytellers. the repertoire is large enough, and the repetitive markers could well represent grammatical prefixes or suffixes. One guesses, but that's not a bad guess. I'd consider it an equally good guess that the disk was a one-off story-telling memorization aid, sketching out an epic tale and its episodes mnemonically. The plumed head prefixes could equally well represent major protagonists in the tale. If the obviously recognizable patter of PLUMED-HEAD SHIELD is a prefix (or suffix) in a set of words, then its distribution is fishy on the document -- it is ubiquitous on Side A, then starts the first word of Side B, but then vanishes. That defective distribution casts doubt on it as a common language affix, but does suggest a major actor in a long tale, who dies midstory as the tale continues. One guesses, but that's not a bad guess. ;-) --Ken -- Michael Everson *** Everson Typography *** http://www.evertype.com
Definition of character: Exegesis of SC2 nomenclature
One possibly interesting thing derived from the threads from hell is the notion that the definition of character offered in the various ISO JTC1/SC2 character encoding standards and TR's such as the Character-Glyph Model (TR 15825) may be leading people astray about what is appropriate to encode as a character. Here is an attempt at an exegesis. The standard SC2 definition of a character is: A member of a set of elements used for the organization, control, or representation of data. [Quoted from ISO/IEC 10646, Clause 4 Terms and definitions, but you can find the same definition in other SC2 standards, including each part of ISO/IEC 8859, and in ISO/IEC 2022.] The *reason* why SC2 chose such a strange and seemingly open-ended definition was *not* to invite arbitrarily strange collections of data control elements to be encoded as characters, but rather an attempt, in a procrustean way, to get the definition to fit the reality. In the ISO 2022 architectural framework for character encodings, specific character set definitions are declared as consisting of one or more sets of graphic characters (G0 and G1 sets) and one or more sets of control functions (C0 and C1 sets), where the graphic characters come from registered (graphic) character encodings and where the control functions come from registered control function sets. The graphic character encodings are the typical character encodings we are familiar with, of which ISO/IEC 8859-1 (Latin-1) is a prototypical example -- a bunch of visible letters, digits, punctuation, and symbols. The control function sets are small sets of functions designed for the manipulation and control of characters in various device contexts (mostly terminal hardware), and consist of things like line advance, moving the cursor back and forwards, indicating start and end of transmission context, marking string delimitations, and the like. The best known of these control function sets is defined in ISO 6429, and its C0 set is also grandfathered into ASCII as the familiar ASCII control codes -- the same codes that are listed in Unicode as aliases for U+..U+001F (U+ null, U+0001, start of heading, ... U+0008 backspace, U+0009 tab, ... etc.) Note that the control functions are not just any imaginable set of functions -- they are functions designed by people interested in controlling characters on existing classes of output display devices (terminals and teletypes, primarily). And not all terminal control functions were defined as control functions in these sets, either. Large classes of such functions were left up to vendor implementation, and made use of ESC(ape) sequences for their initiation. In the context of SC2 character encoding standards, a cover term for character was needed which was broad enough to deal with the existing, on the ground implementation fact that systems included graphic characters *and* control characters mixed in character data streams. The graphic characters were conceived of as representing the content of text, primarily. And the then-existing usage of control characters was primarily to organize and control the representation of such data, by establishing line breaks, page breaks, string or other text unit delimitations, backspacing, and the like. Hence the committee compromise definition of character quoted above. That definition should be understood in the context of this history, however. It is not legal license for intentional or unintentional misunderstandings of the appropriate scope of character encodings, which should be focussed on textual content, together with the minimal additional format control specification required for text organization. Modern text representational practice, in a world that has mostly abandoned character terminal display to niche and legacy uses, and which instead uses graphic displays and image models, combined with rasterizing of outline fonts for textual display, has essentially made most of the ISO 6429 control functions obsolete. The Unicode Standard only specifies the few control functions that have survived into modern plain text handling conventions: CR, LF, FF, and tabs, among them. On the other hand, the Unicode plain text model has necessitated the addition of new format control characters that were not envisioned in the terminal control function sets, or which were organized differently for them. A good case in point are the various Unicode bidi control format characters, which are used for the bidirectional algorithm to override default implicit bidi ordering for various edge cases. Those differ from the bidirectional formatting control functions which were earlier designed for use on designated character terminals, with fixed-size cells and fixed line widths, for laying out visual order bidi text legibly via control of cursor position and direction when fed a serial byte stream to be laid out. Note that in any case, the old control functions (aimed at serial output devices) and the new
Re: *Why* are precomposed characters required for backward compatibility?
David Hopwood wrote: Marco Cimarosti wrote: BTW, they always sold me that precomposed accented letters exist in Unicode only because of backward compatibility with existing standards. I don't get that argument. It is not difficult to round-trip convert between NFD and a non-Unicode standard that uses precomposed characters. Round-trip convertability of strings does not imply round-trip convertability of individual characters, and I don't see why the latter would be necessary. Because while it is conceptually not difficult to roundtrip convert between legacy accented Latin characters and Unicode NFD combining character sequences, in practice many Unicode implementations would never have gotten off the ground if they had had to start with combining character sequences for all Latin letters, including, in particular, the 8859 repertoires. And the character mapping tables are considerably more complex, in practice, if they must map 1-n, n-1, rather than 1-1. Right now, a Latin-1 to Unicode mapping table is trivial, but if Latin-1 had not been covered with a set of precomposed characters, the mapping would *not* have been trivial, and that would have been a significant barrier to early Unicode adoption. And people would *still* be complaining -- vigorously -- about the performance hit and maintenance complexity of interoperating with 8859 and common PC code pages. The only difficulty would have been if a pre-existing standard had supported both precomposed and decomposed encodings of the same combining mark. I don't ^^ /character think there are any such standards (other than Unicode as it is now), are there? Not to my knowledge. (Obviously, an NFD-only Unicode would not have been an extension of ISO-8859-1. That wouldn't have been much of a loss; it would still have been an extension of US-ASCII.) If this compatibility issue didn't exist, Unicode would be like NFD. And would have been much simpler and better for it, IMHO. It would have been better, in some respects, to treat Latin like the complex script it is, and to end up with the same kind of clean, by-the-principles encoding that Unicode has for Devanagari, essentially free of equivalences and normalization difficulties. But it took years for major platforms to get up to speed on complex script rendering, including the relatively simple but elusive prospect of dynamic application of diacritics to Latin letters (and/or mapping of combining character sequences to preformed complex glyphs). And despite the vigorous advocacy by some factions of early Unicoders to have a consistent, decomposed Latin representation in Unicode, there were some rather hard-headed decisions made early on (1989) that that approach would cripple what was then an experimental encoding. The inclusion of large numbers of precomposed Latin letters as encoded characters was the price for the participation of IBM, Microsoft, and the Unix vendors, and was also the price for the possibility of alignment of Unicode with an ISO international standard. Without paying those prices, Unicode would not exist today, in my opinion. --Ken - -- David Hopwood [EMAIL PROTECTED]
Re: Ending the Overington [debate]
David Hopwood responded to Michael Everson: people just keep saying that markup exists, as if the very existence of XML in some way precludes single code point colour codes and single code point formatting codes and so on. Yes, that is right. That is entirely right. No it isn't. Duplicating functionality between character encoding and markup is just a Bad Thing (usually). Agreed. And that is part of the reason for the existence of a Unicode Technical Report (and W3C Note) which tries to set guidelines on what is and is not appropriate to use in the context of markup. For those who haven't seen it, UTR #20: Unicode in XML and other Markup Languages: http://www.unicode.org/unicode/reports/tr20/ It is certainly not excluded a priori - as demonstrated by the interlinear annotation markers, stateful BiDi controls, and plane 14 language tags. Correct. But just because the line between plain text content and the kind of formatting or other presentational and/or annotational material is often difficult to firmly draw doesn't mean that we have open season to simply dump anything we want into character codes. On color, for example, there is clear consensus that encoding color by characters is way, wy over the line into the kind of stuff which should be handled by markup (as for setting the text color on hyperlinks) or even by out-and-out graphics (as for display text elements). The existence of XML (or other markup languages) does not, ipso facto, preclude the character encoding committees from encoding single code point colour codes. Rather, the consensus among character encoding committees that text color is better handled by other layers of text (and non-text) presentation and is inappropriate for encoding as characters precludes them from making what would be utterly controversial and nonconsensual encoding decisions. I see that no-one in this thread has even attempted to explain why duplication of functionality across layers is a bad idea, or to discuss what alternative models would have been possible besides plain text + {HTML,SGML,XML,TeX}-style markup languages. I'll try to do that in another post. I'm looking forward to it. --Ken - -- David Hopwood [EMAIL PROTECTED]
Re: Multiple encodings for 1 character
Theodore wrote: What is going to be done about the confusion generated from having multiple ways to encode the same character? For example, for filenames, OSX will encode an accented Roman letter one way, while for filenames Windows will encode it the other way. These kind of confusions are totally expected, if Unicode will allow more than one way to encode the same character. Perhaps a stray newsfeed routed via Alpha Centauri? This is *very* old news, indeed. This means that matching algorithm's won't work, because the characters are different! Will there be some kind of recommendation of which to avoid? Will the Unicode consortium make a standard to say that one of these encodings is strongly not recommended, and in fact depreciated? UAX #15: Unicode Normalization Forms http://www.unicode.org/unicode/reports/tr15/ And it is up to an implementation to specify which normalization form it uses. By the way, we don't depreciate Unicode encodings -- we appreciate them. ;-) And what about the OS that uses this encoding? How will the Unicode consortium make the newly-offending OS change it's ways? It isn't offending, and the Unicode Consortium won't. --Ken
Re: Whats the difference between a composite and a combining sequence?
Theodore, http://www.unicode.org/unicode/reports/tr15/ mentions both composites and combining sequences. But it doesn't tell us the difference. I know what a combining sequence is. If I didn't know what a composite was, I'd guess it was the same thing as a combining sequence. See TUS 3.0, Chapter 3, pp. 43-44 D17 Combining character sequence: a character sequence consisting of either a base character followed by a sequence of one or more combining characters, or a sequence of one or more combining characters. [e.g. A + combining-grave U+0041, U+0300] D18 Decomposable character: a character that is equivalent to a sequence of one or more other characters, according to the decomposition mappings found in the names list... It may also be known as a precomposed character or composite character. [e.g. A-grave, U+00C0] --Ken
Re: FW: Inappropriate Proposals FAQ
Suzanne, Can people from the review committee give me some hard and fast rules for when something is thrown out? As Michael Everson indicated, the answer to this is probably not. However, perhaps the most important thing for serious script proposers to do, to see if what they are concerned about might be acceptable, is to consult the Roadmap: http://www.unicode.org/roadmaps/ If a script is listed there in the Roadmap for the BMP or for Plane 1, then people can be assured that interested members of the encoding committees have *already* made a tentative determination that the script is suitable for encoding, although a proposal may not actually exist yet, and of course, there are no guarantees until the committees actually do the work on fully filled-out formal proposals. But if a script, like the MIIB BurgerKing cipher mentioned today, or chess diagram notation, is missing from the Roadmap, there is probably a *good* reason for it not to be there, and people should think twice (and then again) before they start proposing it for encoding in Unicode. --Ken Another missing example: The voice which shook the earth, from Chapter IV, verse 44 of LIBER LIBERI vel LAPIDIS LAZULI ADUMBRATIO KABBALÆ ÆGYPTIORUM, one of the Holy Books of Thelema: http://www.nuit.org/thelema/Library/HolyBooks/LibVII.html Disclaimer: The UTC New Scripts committee does not discriminate among script applicants on the basis of race, color, gender, religion, sexual orientation, national or ethnic origin, age, disability, or veteran status. However, if they are risible, we reserve the right to laugh. ;-)
Re: (long) Re: Chromatic font research
[*groans in the audience*] I know, I know -- another contribution in the endless thread... In re: The Respectfully Experiment I used it as evidence that ideas about what should not be included in Unicode can change over a period of time as new scientific evidence is discovered. Having been intimately involved in nearly all the decisions made about what was included in Unicode over the last 13 years, and also being formally trained as a scientist, I think I may be qualified to dispute this conclusion. Most of the change in ideas about what can be included in Unicode have been the result of two types of influence: A. The encountering of legacy practice in preexisting character encodings which had to be accomodated for interoperability reasons. This accounts for many, if not all of the hinky little edge cases where Unicode appears to depart from its general principles for how to encode characters. B. The development of new processing requirements that required special kinds of encoded characters. This accounted for strange animals such as the bidi format controls, the BOM, the object replacement character, and the like. There is a very narrow window of opportunity for *scientific* evidence contributing to this -- namely, the result of graphological analysis of previously poorly studied ancient or minority scripts, which conceivably could turn up some obscure new principle of writing systems that would require Unicode to consider adding a new type of character to accomodate it. But at this point, with Unicode having managed to encode everything from Arabic to Mongolian to Han to Khmer..., I consider it rather unlikely that scientific graphological study is going to turn up many new fundamental principles here. As a scientific *hypothesis* I think this surmise is proving to hold up rather well, as our premier encoder of historic and minority scripts, Michael Everson, has managed to successfully pull together encoding proposals, based on current principles in Unicode, for dozens more scripts, with little difficulty except for that inherent in extracting information about rather poorly documented writing systems. it just seems to me that some extra ligature characters in the U+FB.. block would be useful. Best practice, and near unanimous consensus in the Unicode Technical Committee and among the correspondents on this list, would be aligned with exactly the opposite opinion. In the light of this new evidence, I am wondering whether the decision not to encode any new ligatures in regular Unicode could possibly be looked at again. As others have pointed out, The Respectfully Experiment did not constitute new *evidence* of anything in this regard. In any case, the UTC is quite unlikely to look at that decision again. The exception that the UTC *has* considered recently was the Arabic bismillah ligature, and the reason for doing so again was the result of considering legacy practice. This thing exists in implemented character encodings as a single encoded character. And furthermore, it is used as a unitary symbol, in such a way that substituting out an actual (long) string of Arabic letters and expecting the software to ligate it correctly precisely in the contexts where it was being used as a symbol, would place an unnecessary burden on both users and on software implementations. That is *quite* different from the position that claims that one, two, or dozens more Latin ligatures of two letters need to be given standard Unicode encodings. if it cannot be done or would cause great anguish and arguments, well, that is that, forget it. Good idea. --Ken
Re: ZWJ and Latin Ligatures (was Re: (long) Re: Chromatic font research)
James Kass said: One problem with TR28 is that it is worded so that it appears to be in addition to earlier guidelines. It is. The way this works is as follows: The original decision about the ZWJ as request for ligation was documented in the Unicode 3.0.1 update notice. That documentation was rolled forward into UAX #27 (Unicode 3.1), where it was explicitly cast as text to replace the Unicode 3.0 text on p. 318 re Controlling Ligatures, including an update of the example table. The additional text in UAX #28 is just that -- an *addition* to the Unicode 3.1 text, not a replacement for it. This will all become more apparent when we can finally publish Unicode 4.0, which will roll all of the textual additions, once again, into a single published document. This implies that the examples used in TR27, for one, are still valid. They are. In TR27, font developers are urged to add things like f+ZWJ+i to existing tables where f+i is already present. That recommendation still stands -- and, as John pointed out, is being implemented by vendors. Another problem with TR28 is that its date is earlier than the date on TR27. This suggests that TR27 is more current. I don't understand this claim. The date on UAX #27 is: 2001-05-16 The date on UAX #28 is: 2002-03-07 Please check that you are referring to the most recent (and only valid) versions of each. Otherwise, regarding the substance of this thread, I find myself in violent agreement with John, who it seems to me is quite ably stating the case for the current treatment as decided by the UTC. --Ken
Re: Chromatic font research
Philipp said: The most obvious and simple example for glyph colours with semantic meaning that I can think of appears to be encoding characters for national flags (something that might even be considered proposable). As *characters*? Why? What is this bug that people catch, which induces them to consider all things semiological to be, ipso facto, abstract characters suitable for encoding in Unicode? There are signs that are not characters. There are symbols that are not characters. There are icons that are not characters. There are significant gestures that are not characters. There are meaningful looks that are not characters. There are color significances that are not characters. There are pregnant pauses that are not characters... And I'm quite positive that Aztec can safely considered writing... Aztec is clearly a language. Whether or not the Aztec codices are appropriate to represent in plain text remains to be seen. As yet, we have no proposal, let alone one which addresses the potential problems in detail. --Ken Philipp
Re: Hexadecimal characters.
At 03:03 AM 6/20/02 -0400, Tom Finch wrote: I wish to propose sixteen consecutive digits for the purpose of displaying hexadecimal values. [...] Has this been considered? [David Starner] I seem to recall that it has. The problem is, they're just new copies of old characters. An A used in hexadecimal notation is just an A. Besides the problem with normalization, you have the problem with all look-alike characters - people won't use them consistently. Even if this got adopted, 99% of time you looked at hexadecimal numbers, they would be in plain old ASCII, so you don't really gain anything but confusion. It's a no-go. [Tom Finch] I looked at the code chart and there are many 16 character sequences empty. That is true enough -- but the more appropriate place to look is the BMP roadmap: http://www.unicode.org/roadmaps/bmp-3-6.html where you can see that many of those empty columns are already accounted for by roadmapped allocations for living minority scripts. The BMP is rather tight now for allocation, and it is unlikely that the committees are going to look kindly on miscellaneous collections of dubious stuff for encoding there. Of course there is plenty of space in Plane 1 for just about everything, but... That said, David Starner has this one right. There really is no good reason to create clones of 0..9, A..F to represent hexadecimal digits. The existing characters do that just fine, and represent an overwhelming legacy data representation precedent that any proposal such as Tom Finch's would have to cope with. Introducing new characters for these would just introduce confusion and would be unlikely to be implemented in any useful way. --Ken
Re: Chess symbols, ZWJ, Opentype and holly type ornaments.
In view of the fact that some people are unwilling to let my ideas be discussed in this forum upon their academic merit but simply use an ad hominem attack almost every time I post (before many people can have the chance to sit down and, if they wish, have a serious read of my ideas), when it seems that their objection is really about the Unicode Consortium having included the word published in section 13.5 of chapter 13 of the Unicode specification, ... Speaking here as an editor of the Unicode Standard, I do not find the word published in section 13.5 of the book. Perhaps William was thinking of the subheader Promotion of Private-Use Characters. Since -- despite the explicit text that follows in that section -- some people seem to be getting the wrong idea about private-use character assignments as a step towards standardization, it is quite likely that the editorial committee will be rewriting that section for Unicode 4.0, to provide further clarification for users. I feel that the fact that I am trying to use the Unicode specification as it exists rather than on some nudge nudge wink wink understanding of how some people feel that it should be interpreted is at the root of the problem. If parts of the Unicode Standard are unclear and are leading to misinterpretations or incompatible interpretations of how characters should be used -- including private-use agreements for private-use characters, then airing those issues is certainly germane to this discussion list. I think what a number of people on the list have been hinting -- or openly stating -- is that prolixity is not a virtue on an email list when trying to convey one's ideas. --Ken
Re: Hexadecimal characters.
Tom Finch said: Hmm, so representing Devanagari digits is more important than hexadecimal, which is used almost more than decimal on the web? I think you may be misconstruing the purpose of the character encoding here. If I want to represent the hexadecimal numbers 0x60DB 0x618A in email or in HTML hexadecimal NCR's or whatever, guess what -- I can use ASCII (or Latin-1 or Unicode) characters: 6 0 D B 6 1 8 A -- and that is what everyone does. It is also what is *required* by the HTML and XML standards for the representation of hexadecimal NCR's on the web, by the way. If I want to represent Devanagari digits, on the other hand, I don't have an ASCII representation to hand -- those *require* separate encoding, since Devanagari characters are not the same as Latin characters or Arabic digits. So Devanagari digits were encoded in Unicode. Simple. I know inertia is a law of the universe, but this is rediculous. Hexadecimal is very important and deserves to be in Plane 0. Umm. It *is* in Plane 0: U+0030..U+0039, U+0041..U+0046 (and U+0061..U+0066), to be exact. I see a good spot in misc technical (23D--oh look hexadecimal again). Nobody has any quarrel with the notion that hexadecimal notation is very important in computer science -- and vital for character encoding discussions. The issue is whether we need any separate characters to represent hexadecimal digits, when we already have the digits everybody has been using for decades encoded. --Ken
Re: Chess symbols, ZWJ, Opentype and holly type ornaments.
IOW, brevity's wit's soul. Well-spoken, dear Polonius. But better to Adorn the soul of wit so briefly put to us. My liege, and madam, to expostulate What majesty should be, what duty is, Why day is day, night is night, and time is time. Were nothing but to waste night, day, and time. Therefore, since brevity is the soul of wit, And tediousness the limbs and outward flourishes, I will be brief. Your noble son is mad. --the Bard
Re: Q: How many enumerated characters in Unicode?
Adam asked: How many characters does the current version of the Unicode Standard enumerate? 95,156. BTW: I think this information would be useful if it were always included in the summary of earch revision. Agreed. The total was listed in Unicode 3.1 (94,140), and you could get the number for Unicode 3.2 by adding the 1016 additions to that, but it was an oversight not to actually list the total in the text of Unicode 3.2. --Ken
Re: Fixed position combining classes (Was: Combining class for Thaicharacters)
Peter, On 06/02/2002 05:40:05 AM Samphan Raruenrom wrote: My opinion is that they should have been simplified, but that setting the bulk of them to 0 was a mistake and creates some significant problems (which go a step beyond the questions you raise here). Can you elaborate on this? Given the characters : 0E35;THAI CHARACTER SARA II;Mn;0 : 0E39;THAI CHARACTER SARA UU;Mn;103 consider the sequences 0e35, 0e39 vs. 0e39, 0e35 I'm guessing your first reaction will be to say that these cannot co-occur. That is true for the Thai language, but may not be true for other languages written with Thai script. The problem, of course, is that not all eventualities could be foreseen at the time the decisions had to be made -- when normalization and Unicode 3.0 were looming. It might have been possible to marginally improve on the assignments that eventually were made -- but both the original assignment to fixed position classes, and the later simplification of the fixed position classes, had to be made *prior* to the accumulation of experience based on normalization being locked down in the standard. So hindsight is 20/20. But at the time, the editors and participants in the UTC couldn't get experts to pay enough attention to the potential implications for Thai and other Southeast Asian scripts, so now we are stuck with a few anomalies that people will just have to program around, I am afraid. Now, the problem with the sequences above is that they are visually indistinct, meaning that they could not possibly be used by users for a semantically-relevant distinction. From the user's perspective, they are identical. Moreover, it would fit a user's expectations to have string comparisons to equate them (e.g. a search for 0e35, 0e39 should find a match if the data contains 0e39, 0e35 ). They are both canonically-ordered sequences, however, since U+0E35 has a combining class of 0. The result is that string comparisons that rely on normalisation into any one of the existing Unicode normalisation forms (NFD, NFC, NFKD, NFKC) will fail to consider these as equal. I think you are missing a point here. It is true that if you just take the two strings, normalize them, and then compare binary, they will compare unequal. But for most user's expectations of equivalent string comparisons, simply comparing binary for normalized strings is insufficient, anyway. There may be embedded (invisible) format control characters (ZWJ and its ilk) which should be ignored on comparison -- but a simple binary compare won't do that. The presence of a ZWSP might or might not be considered as indicative of a string difference by a user, but would definitively cause the strings to compare unequal without a corresponding visual difference. On the other hand, the presence of some types of visual punctuation might be considered insignificant by a user, and to be ignored, even though causing a visual difference. The ordinary way to deal with this is to enhance the comparisons, often in language-specific ways, to match user expectations of what should and should not compare equal under various circumstances. And a commonly used technology for that is one form or another of collation tailoring for culturally expecting string comparison. If such technology is being used to provide better results, there is no particular reason why the language-specific tailorings for it cannot also take into account the few anomalous cases resulting from canonical ordering of dependent vowels in Brahmi-derived scripts in Southeast Asia, so that, under those circumstances, 0e35, 0e39 vs. 0e39, 0e35 *would* compare equal. IMO, it'll be the best if we could change that. But apart from that, it still be useful to note what is right or wrong than not to say about it. After all, this happends to other (Indic) scripts too, right? There are some similar problems in at least Lao, Khmer and Myanmar. I don't recall for certain, but there may also be similar problems in Hebrew. And each of the cases are fairly limited and amenable to the same kinds of solutions, script by script, and language by language. In any case, I think one is going to have to have some rather specific string comparison extensions to get Khmer and Myanmar string orderings and matchings to behave appropriately. And people who need to make those extensions aren't going to be particularly misled by the few anomalous instances of above or below vowel signs having zero combining classes, which make it technically possible to have non-canonically equivalent spellings of visually similar combinations. --Ken
RE: How is UTF8, UTF16 and UTF32 encoded?
Rick Cameron asked: The Unicode Standard 2.0 had a table in Appendix A that is, I think, just what you're asking for. I can't find this table in the online version of TUS 3.0 (it's not very useful that the online index gives page numbers, when there's no way to map a page number to the appropriate chapter!) Does anyone know whether this table (A-3 on page A-7) is available online somewhere? Table A-3 from Unicode 2.0 moved into Chapter 3 in Unicode 3.0, since UTF-8 was itself formally incorporated into Unicode conformance at that point. See Table 3-1 on page 47 of Unicode 3.0. (Unfortunately, access to the table was not clearly indicated under the UTF-8 entry in the index to Unicode 3.0 -- an oversight that will definitely be fixed for Unicode 4.0.) You can find it online in Chapter 3 of the online text of Unicode 3.0 at: http://www.unicode.org/unicode/uni2book/u2.html The surrounding text for Table 3-1 was modified for Unicode 3.1, so you can find the Table online again in Unicode 3.1: http://www.unicode.org/unicode/reports/tr27/ (See Article III Conformance, in that UAX.) And finally, Unicode 3.1 added a subsidiary table of Legal UTF-8 Byte Sequences. That table was modified slightly for Unicode 3.2, so the most up-to-date version online can be found in Unicode 3.2: http://www.unicode.org/unicode/reports/tr28/ (See Article III Conformance, in that UAX.) --Ken