Re: Four characters from Greek Extended block missing?
On Fri, 16 Feb 2001, Otto Stolz wrote: > So the questions are: > - are the above-mentioned lower-case upsilon composites useless, > and entered Unicode only by an oversight, or > - are their upper-case equivalents missing by an oversight, or > - is there indeed a rationale for this anomaly? The Upsilons with smooth breathings are unacceptable word-initially in Attic; the only exception I found in Liddel-Scott-Jones was the old name of the letter itself, U)=. The lowercase glyph is acceptable in Attic, because it can occur as the second letter of an initial diphthong; in old typographies where all-caps words had accents, this can also occur with capital upsilon. Upsilon with smooth breathing can additionally occur word-initially in other dialects, but these two cases are rare enough for no standard to rush to include it. In our corpus, initial capital upsilon with a smooth breathing occurs 37 times in a corpus of 76 million words of Greek; lower case upsilon with a smooth breathing occurs 373 times. With epigraphical data, this will obviously be more frequent. -- Nick Nicholas. TLG, UCI, USA. [EMAIL PROTECTED]; www.tlg.uci.edu/~opoudjis Many among their proselytes had sold their lands and houses to increase the public riches of the sect --- at the expense, indeed, of their unfortunate children, who found themselves beggars because their parents had been saints. (Edward Gibbon, _Decline and Fall_.)
Unicode character encoding statistics
BTW, if anyone was wondering where I came up with the figure 880,325 reserved unassigned code points for Unicode 3.1, here are the complete statistics for Unicode 3.0 and Unicode 3.1: Unicode: U 3.0 U 3.1 BMP Alphas/Symbols 10236 10238 Suppl Alphas/Symbols 1691 Han (URO)20902 20902 Han (Ext A) 65826582 Han (Ext B) 42711 Han Compat 302 302 Suppl Han Compat 542 Hangul Syllables 11172 11172 Subtotal 49194 94140 BMP Private Use 64006400 Suppl Private Use 131068 131068 Surrogate Code Points 20482048 Controls65 65 BMP Noncharacters2 34 Suppl Noncharacters 32 32 BMP Reserved 78277793 Suppl Reserved 917476 872532 The total number of code points accounted for here is 1,114,112 (= 17 x 64K), i.e. U+..U+10. --Ken
Re: Four characters from Greek Extended block missing?
Otto Stolz asked: > in the Greek Extended block, five of the lower-case characters > do not have upper-case equivalents, viz. > U+1FE4 GREEK SMALL LETTER RHO WITH PSILI > U+1F50 GREEK SMALL LETTER UPSILON WITH PSILI > U+1F52 GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA > U+1F54 GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA > U+1F56 GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI > > However, the missing upsilon variants escape my understanding: > - word-initial upsilon (both lower-case and upper-case) must take > a breathing mark, > - medial and final upsilons do not take breathing marks. > So, you will either need both sorts of marks on both cases, > or you will need only dasia on both cases (I do not remember any > word starting with psili-upsilon, but my Greek is rather rusty). > > So the questions are: > - are the above-mentioned lower-case upsilon composites useless, > and entered Unicode only by an oversight, or No. Initial upsilon with PSILI (smooth breathing) is exceedingly rare in classical Greek, but it does occur. I find exactly two instances in my copy of the intermediate Greek-English Lexicon (Liddell and Scott): One entry showing 1F54 ~ 1F56 meaning "sound to imitate a person snuffing a feast" [sic]. And one head entry in caps showing ,'YRXA meaning "a jar, for pickles". Clearly these are both "funny" words. The first is onomatopoetic, and the second is probably a borrowing of some sort from a non-Greek language. The vast preponderance of upsilon-initial words in classical Greek have rough breathings. No doubt someone with access to more extensive classical and Byzantine Greek lexica might turn up a few other instances, including, I am guessing, instances of 1F50 and 1F52. > - are their upper-case equivalents missing by an oversight, or I don't think so. > - is there indeed a rationale for this anomaly? The entire 1FXX set was provided by ELOT, the Greek national body, and they had prescriptive, as well as descriptive intent in choosing the set that they did. I suspect that they thought that uppercase initial upsilon with a smooth breathing would not fit their orthographic rules for polytonic Greek (although there are instances of it in print, as in the uppercase head entry in Liddell and Scott for "pickle jar"). And in any case, by use of the spacing breathing/accent combinations U+1FCE, etc., plus regular uppercase upsilon, you can represent any of the missing letters, anyway. (As I have done above for the all caps pickle jar entry.) > Note that the code-points where you would expect these upper-case > upsilon compositions, viz. U+1F58 U+1F5A U+1F5C U+1F5E, are left > unassigned (reserved). > > Can anybody shade some light on this anomaly: either explain the > underlying rationale, or acknowledge the oversight? The Unicode take on this is that the entire block U+1F00..U+1FFE of precomposed polytonic Greek is unnecessary, since it is all decomposable into the regular Greek alphabet and a small number of accents. There clearly would be no benefit at this point in adding in the 4 (or 5) "missing" polytonic Greek characters, since in *all* Unicode normalization forms they would end up being decomposed into the already existing combining character sequences that can be used to represent them now without any character additions. --Ken
Re: [very OT] Documentation: beyond 65,536 ; misc Semitic ?s
Elaine Keown asked: > Within the book, Unicode 3.0, is there somewhere a long section I > missed about all the stuff that happens beyond the "first 65,536," > in addition to surrogate stuff? No. > Is there other documentation somewhere? Yes -- in the next version of the standard. See: http://www.unicode.org/unicode/reports/tr27/ and http://www.unicode.org/charts/draftunicode31/ > > Today are there still 7,827 unused code values? Actually, there are 880,325 reserved unassigned code points (7,793 on the BMP and 872,532 on the supplementary planes). > Will they be unassigned until version 4.0 gels? No. Unicode 3.1 has already been approved, and is in the last stages of publication. After that, Unicode 3.2 will appear, adding over 1000 more characters to the BMP. Unicode Version 4.0 is beyond that, and will, no doubt, add another collection of characters. > > Also, is there a linguistic index to Unicode character > database files, saying which mention Semitic languages? No. But simple tools like grep enable you to pull out all instances of ARABIC, HEBREW, or SYRIAC characters, if you want. > > And finally, is there documentation somewhere on whether 3.0 > has complete symbols for the 18 languages written in Arabic > script that are mentioned in the book? I presume you are talking about letters and points, rather than symbols per se. The consortium doesn't have any explicit language-by-language listing of Arabic alphabets and their correlation with the encoded characters. However, the UTC does consider the current encoding to be complete for the languages that are explicitly mentioned, as well as for many others written with the Arabic script that are not explicitly mentioned. --Ken
Four characters from Greek Extended block missing?
Hello, in the Greek Extended block, five of the lower-case characters do not have upper-case equivalents, viz. U+1FE4 GREEK SMALL LETTER RHO WITH PSILI U+1F50 GREEK SMALL LETTER UPSILON WITH PSILI U+1F52 GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA U+1F54 GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA U+1F56 GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI The Rho with psili is indeed only needed in lower-case: - word-initial rho (upper-case or lower-case) takes a dasia, - a double-rho within a word can be adorned with a psili and a dasia, which is not done in upper-case typing, - no other medial or final rho takes a breathing mark. However, the missing upsilon variants escape my understanding: - word-initial upsilon (both lower-case and upper-case) must take a breathing mark, - medial and final upsilons do not take breathing marks. So, you will either need both sorts of marks on both cases, or you will need only dasia on both cases (I do not remember any word starting with psili-upsilon, but my Greek is rather rusty). So the questions are: - are the above-mentioned lower-case upsilon composites useless, and entered Unicode only by an oversight, or - are their upper-case equivalents missing by an oversight, or - is there indeed a rationale for this anomaly? Note that the code-points where you would expect these upper-case upsilon compositions, viz. U+1F58 U+1F5A U+1F5C U+1F5E, are left unassigned (reserved). Can anybody shade some light on this anomaly: either explain the underlying rationale, or acknowledge the oversight? Best wishes, Otto Stolz
Re: Surrogate space in Unicode
Tom Lord asked: > > It has proven difficult to come up with convenient terms for > > the Unicode characters encoded at U+1 and beyond. > > [] > > 2. A 'basic' code point, which may represent a 'basic > > character', can range from U+ through U+. > > > > For what purpose is such a distinction needed? > And Doug Ewell answered: > It is needed because of UTF-16, which requires two 16-bit code points to > represent a character with a value of U+1 or higher (a supplementary > character) but only one 16-bit code point to represent a basic character. This is correct, except that it is two 16-bit code *units* required to represent supplementary characters. For the UTF-32 encoding form, there is nothing special about supplementary characters (characters whose Unicode scalar value, i.e. code point, is between 0x1 and 0x10), except that they've only recently started to be standardized. For the UTF-8 encoding form, supplementary characters are represented in 4 bytes, while basic characters are represented in 1, 2, or 3 bytes. This could have an implication for an implementation, although proper UTF-8 implementations should already be handling them correctly. The big issue is for UTF-8 implementations that *incorrectly* handle supplementary characters as sequences of two 3-byte representations of surrogate code points. In order to talk meaningfully about those issues, a terminological distinction between basic and supplementary characters is useful. For the UTF-16 encoding form, as Doug pointed out, the difference is between 1 code unit versus 2 code units for representation of a code point. That distinction is rather significant for many Unicode implementations, and again a terminological distinction is useful. Finally, for comparison to ISO/IEC 10646, it is also useful to have a terminological distinction that lines up with the international standard. 10646 has settled on the term "supplementary planes" to refer to Planes 1 through 16, so the use of the term "supplementary character" in Unicode to refer to characters encoded on the supplementary planes makes it easier to understand what is intended, no matter which of the two standards you are coming from. > > Many descriptions on the Web erroneously claim that Unicode contains only the > first 64K characters of ISO 10646. Even the Unicode Standard Version 3.0 > states, "Plain Unicode text consists of sequences of 16-bit character codes." > To me this sentence is very misleading and requires that special attention > be paid to the nature of supplementary characters, those to be assigned in > Unicode 3.1 and those to be assigned in future versions. That sentence will be updated eventually. The critical piece of text in the standard is conformance clause C1 on page 37, which currently reads: "C1 A process shall interpret Unicode code values as 16-bit quantities. * Unicode values can be stored in native 16-bit machine words." In Unicode 3.1, about to be published in UAX #27, that wording is being changed to: "C1 A process shall interpret the Unicode code units in accordance with the Unicode Transformation Format used. * The Unicode Standard defines code points (scalar values) that can be encoded in any of three transformation formats (encoding forms): UTF-8, UTF-16, or UTF-32." The PDUTR #27 text currently accessible on the website does not yet show this change, which was just accepted at the recent UTC meeting, but expect an updated text for what will eventually become UAX #27 to show up on the site in approximately a week. --Ken
RE: Unicode Transcriptions
Thomas Chan noted: > Yes, you are right about this. I don't know why TUS3.0 p. 278 says "The > character U+3127 BOPOMOFO LETTER I is usually written as a vertical > stroke when Bopomofo text is set vertically.", which is *wrong*. This is a x/y axis dyslexia that set in when a text correction was misapplied to the text. I am reporting it to errata. --Ken
Re: Surrogate space in Unicode
In a message dated 2001-02-16 0:19:01 Pacific Standard Time, [EMAIL PROTECTED] writes: > Because of the widespread belief that Unicode stops at U+, > many fonts and applications that claim to support Unicode can > only handle basic characters, not supplementary characters. > > Right. (Is it really a widespread belief? That's something I've > been wondering.) Well, [EMAIL PROTECTED] seems to think so: > > Many descriptions on the Web erroneously claim that Unicode contains only the > > first 64K characters of ISO 10646. > > Well, AFAICT it's true. > > At some point in the future I suppose it will cease to be true, but if you > say "is" you should be talking about the present. Unicode has been defined as ranging from U+ to U+10 for several years now. The fact that no characters have been assigned beyond U+ before Unicode 3.1 (which is still in beta) does not change this. > > Because of the widespread belief that Unicode stops at U+, many fonts and > > applications that claim to support Unicode can only handle basic characters, > > not supplementary characters. > > The code I wrote is like that, and it'll remain like that for as long as > that's all that can be tested and used in real life. You can already test private-use characters in the U+F and U+10 ranges. Saying that your code shouldn't have to work with characters beyond U+ because no such characters have been assigned yet is like saying it shouldn't have to support U+20B0 through U+20CF. You know characters will be assigned to that range some day, possibly sooner than you think. Back to [EMAIL PROTECTED]: > So using the plain english term "basic" to describe that subset > of Unicode is misleading. > > I agree with you that the language in the standard needs updating. I think that has been tried already, and 'basic' was the best anyone could do. Terms involving 'planes', such as 'BMP' and 'supplementary planes', are discouraged because planes per se are not part of Unicode, only ISO/IEC 10646. I personally don't like 'basic' and 'supplementary' because they seem to imply that the first 64K code points are better in some way, but the most important thing is that the terminology remain consistent, even if flawed. -Doug Ewell Fullerton, California
Re: Surrogate space in Unicode
In a message dated 2001-02-16 7:56:12 Pacific Standard Time, [EMAIL PROTECTED] writes: > It's clearer, but misses what I understand to be the absolutely crucial > distinction between a code point (correctly defined) and a code unit > (mentioned by Mark but not by Doug). For what a code unit is, see > http://www.unicode.org/unicode/reports/tr17 I didn't mention code units because, embarrassingly, I am still having a hard time telling the difference between code points and code units. I have read UTR #17 many times and am still somewhat confused. I'll try again. > I would question whether 'surrogate code points' are really code points. In > the sense that they are a subset of 'code points' as defined, I guess they > are; but they are not only unlike every other code point in that they "do > not directly represent characters", they are explicitly and inexorably > disqualified from so doing, being reserved for use, in pairs, as UTF-16 code > units. (Which is what Mark said, of course.) I think they would still be code points, just like 0xFFFE and 0x (and now others) which are guaranteed never to be characters, for a different reason. > Looked at in this way, surely it makes it clearer that the transcoding of a > surrogate (code point) into UTF-8 is an abomination. > > Simplification is all very well, but it can be taken too far, as when > important distinctions are lost. Yes, that is true. I might have known better than to respond to a "cut the mumbo-jumbo" post. Einstein said, "Everything should be made as simple as possible, but not one bit simpler," and I think that is especially true when working with standards and specifications, where precise and unambiguous wording is crucial. -Doug Ewell Fullerton, California
Myanmar questions
Hi folks, I am looking Burmese, beg your pardon, Myanmar, and I can find answers from my available sources to most of my questions, however I still have some unanswered ones. 1) for the "au" dependent vowel, I believe (extrapolating from the one for "o") the correct encoding is U+1031 U+102C U+1039. However the use of the virama inside of a "matra" part looks surprising to me (and it creates problem to my renderer). 2) There appears to exist a special vowel usually named "ui", which looks like as a combination of i (above) and u (below). How is it supposed to be encoded in Unicode? u before i (as pronounced)? i before u (as usual with Unicode, above before below)? 3) The vowel bearer (1021) is reported to be the one to use at intial when there is no consonant, along with the appropriate vowel sign. However, Unicode also encode the individual glyphs for the independant vowel which does not look like the bearer+the vowel sign. I.e. there does not exist Long A (a space is available at 1022), so I understand I have to encode it as U+1021 U+102C. However, for short i, I can use either U+1021 U+102D, or U+1023. What is the preference? 4) I have in my references another glyph, which looks like 4 but with a straight leg; it is the same as the first part of U+104E, "asformentionned". I do not know the name of the symbol ("leng"?), nor its real use (I guess it is used only as part of the U+104E abbreviation). However, what is the recommanded translation for such a symbol if we encounter it in the wild? 5) I can't figure how looks like "kywe". Is it base_ka + wa_below + ya_to_the_right (but then what is the difference with "*kwye"?), or is it base_ka + ya_to_the_right + a_special_wa_deep_below, the latter being under the "arch" of the ya? Since a drawing always is easier to understand, here are my ideas: | | /\ /\|/\ /\| / \ / \ | / \ / \ | ||| | ||| | ||| | ||| | \ // | \ // | __ ___/| ___/| __= baseline /\ \| \| / \ \ | \___| /\ \__| /\ /__\ Thanks in advance for your answer. Antoine
[very OT] Documentation: beyond 65,536 ; misc Semitic ?s
Hello, Within the book, Unicode 3.0, is there somewhere a long section I missed about all the stuff that happens beyond the "first 65,536," in addition to surrogate stuff? Is there other documentation somewhere? Today are there still 7,827 unused code values? Will they be unassigned until version 4.0 gels? Also, is there a linguistic index to Unicode character database files, saying which mention Semitic languages? And finally, is there documentation somewhere on whether 3.0 has complete symbols for the 18 languages written in Arabic script that are mentioned in the book? Thanks Elaine Find the best deals on the web at AltaVista Shopping! http://www.shopping.altavista.com
Re: Surrogate space in Unicode
See end -> - Original Message - From: <[EMAIL PROTECTED]> To: "Unicode List" <[EMAIL PROTECTED]> Sent: Friday, February 16, 2001 6:05 AM Subject: Re: Surrogate space in Unicode > In a message dated 2001-02-15 15:26:55 Pacific Standard Time, [EMAIL PROTECTED] > writes: > > > > At 2001-02-06 07:48:29 -0800 Mark Davis wrote: > > >> At 2001-02-06 01:51 "nikita k" <[EMAIL PROTECTED]> wrote: > > >> What is surrogate space in unicode? > > > > (Mark defines various terms relating to 'supplementary' and 'surrogate') > > > > So, I guess it's safe to say that a surrogate code point is > > a surrogate code point... which is a surrogate for a supplementary > > code point, which is a code point between something and something > > else. > > > > Someone needs to take a break from the bureaucrateze and learn > > again how to communicate clearly. Is that not a part of the > > goal, here? > > I thought Mark's definitions were both accurate and clear, unlike John's > rejoinder, which was neither. > > It has proven difficult to come up with convenient terms for the Unicode > characters encoded at U+1 and beyond. The term 'surrogate' has been > misused in an attempt to do this. It is important to use consistent terms > that demonstrate an understanding of what is going on. > > I am not a member of the Consortium, and certainly would not consider myself > a bureaucrat, so I wil take a stab at this in the plainest English I can find > that does not sacrifice accuracy. > > 1. A Unicode 'code point' is a number between 0 and 1,114,111 inclusive, > usually expressed in hexadecimal (U+ through U+10). Not every code > point necessarily represents a valid character, although most do. For > example, there is no character encoded at U+. > > 2. A 'basic' code point, which may represent a 'basic character', can range > from U+ through U+. The remaining code points (U+1 through > U+10) are 'supplementary' code points, each of which may represent a > 'supplementary character'. > > 3. 'Surrogate' code points range from U+D800 through U+DFFF (not U+DC00). > They do not directly represent characters (so there is no such thing as a > 'surrogate character'), but two of them may be used together according to the > rules of UTF-16 to represent a supplementary character. The two surrogate > code points used for this purpose would be called a 'surrogate pair'. Don't > separate them. > > Is that better? It's clearer, but misses what I understand to be the absolutely crucial distinction between a code point (correctly defined) and a code unit (mentioned by Mark but not by Doug). For what a code unit is, see http://www.unicode.org/unicode/reports/tr17 I would question whether 'surrogate code points' are really code points. In the sense that they are a subset of 'code points' as defined, I guess they are; but they are not only unlike every other code point in that they "do not directly represent characters", they are explicitly and inexorably disqualified from so doing, being reserved for use, in pairs, as UTF-16 code units. (Which is what Mark said, of course.) Looked at in this way, surely it makes it clearer that the transcoding of a surrogate (code point) into UTF-8 is an abomination. Simplification is all very well, but it can be taken too far, as when important distinctions are lost. For what it's worth, Mike. *** J M Sykes Email: [EMAIL PROTECTED] 97 Oakdale Drive Heald Green CHEADLE Cheshire SK8 3SN UKTel: (44) 161 437 5413 ***
RE: Unicode Transcriptions
On Fri, 16 Feb 2001, Marco Cimarosti wrote: > 2) Which Chinese dialect to adopt for transliterating. Mandarin would be the most likely. > Notice the particularities of Bopomofo spelling: > > - the sound (spelled "ong" in pinyin) is spelled "u-eng"; > - there is no "y" in "yi"; > - there is no sign to indicate the 1st tone. [snip] > Also notice that you may have a few typographical problems in producing the > picture: > > a) In most fonts, the glyph for vowel i is a horizontal line. This is only > valid for vertical texts: in horizontal writing it should be vertical. > (Suggestion: you may substitute it with an uppercase I from a sans-serif > font). Yes, you are right about this. I don't know why TUS3.0 p. 278 says "The character U+3127 BOPOMOFO LETTER I is usually written as a vertical stroke when Bopomofo text is set vertically.", which is *wrong*. > b) The glyph for the "combining breve" (3rd tone) is normally designed to > fit on western lowercase vowels. (Suggestion: if you use a bigger size for > the combining marks, you might get a correct result). I've made two .gif files demonstrating Bopomofo typography: http://deall.ohio-state.edu/grads/chan.200/misc/biaozhunwanguoma.gif http://deall.ohio-state.edu/grads/chan.200/misc/tongyima.gif Both depict left-to-right Han character text, and each character is annotated on its right side with top-to-bottom Bopomofo text. (Alternatively, I could have created versions where the Han character text runs top-to-bottom, and each character is annotated on its right side with top-to-bottom Bopomofo text, but I didn't.) Note the place of the tone diacritics, which is "stacked" even more to the right than the Bopomofo consonants and vowels. Thomas Chan [EMAIL PROTECTED]
RE: Unicode Transcriptions
Subject "RE: Unicode Transcriptions" (I am resending this message because the first version contained too many errors even for my standards:-) Mark Davis wrote: > I am still missing Bopomofo, > [...] > Also, Ken suggested that the Bopomofo should be a Bopomofo transcription of > the Chinese for Unicode, not a transliteration from English. Can anyone > supply that? Once you accept Ken's suggestion, you have two more decisions to make: 1) Which Chinese name to use (you have two on your page, one of which is in both simplified and traditional characters); 2) Which Chinese dialect to adopt for transliterating. Assuming that (1) you want to use the 3-syllable name ("統一碼", which is also used in http://www.unicode.org/unicode/standard/WhatIsUnicode.html) and that (2) you want the official Putonghua (Mandarin) pronunciation, here is what it would be: Chinese:統一碼 Pinyin: tŏngyīmă Bopomofo: ㄊㄨㄥ̆ ㄧ ㄇㄚ̆ Codepoints: 310A 3128 3125 0306 0020 3127 0020 3107 311A 0306 Notice the particularities of Bopomofo spelling: - the sound [uŋ] (spelled "ong" in pinyin) is spelled "u-eng"; - there is no "y" in "yi"; - there is no sign to indicate the 1st tone. Also notice that you may have a few typographical problems in producing the picture: a) In most fonts, the glyph for vowel i is a horizontal line. This is only valid for vertical texts: in horizontal writing it should be vertical. (Suggestion: you may substitute it with an uppercase I from a sans-serif font). b) The glyph for the "combining breve" (3rd tone) is normally designed to fit on western lowercase vowels. (Suggestion: if you use a bigger size for the combining marks, you might get a correct result). Ciao. Marco
re: Unicode Transcriptions
Hi Mark. You wrote: > I am still missing Bopomofo, > [...] > Also, Ken suggested that the Bopomofo should be a Bopomofo transcription of > the Chinese for Unicode, not a transliteration from English. Can anyone > supply that? Once you accept Ken's suggestion, you have two more decisions to make: 1) Which Chinese name to use (you have two on your page); 2) Which Chinese dialect to adopt for transliterating. Assuming that (1) you want to use the 3-syllable name ("統一碼", which is also used in ) and that (2) you want the official Putonghua (Mandarin) pronunciation, here is what it would be: Chinese:統一碼 Pinyin: tŏngyīmă Bopomofo: ㄊㄨㄥ̆ ㄧ ㄇㄚ̆ Unicodes: 310A 3128 3125 0306 0020 3127 0020 3107 311A 0306 Notice the particularities of Bopomofo spelling: - the sound [uŋ] ("ong" in pinyin) is spelled "u-eng"; - there is no "y" in "yi"; - there is no sign to indicate the 1st tone. Also notice that you may have a few typographical problems in producing the picture: a) In most fonts, the glyph for vowel i is a horizontal line. This is only valid for vertical texts: in horizontal spelling it should be vertical. (Suggestion: you may substitute it with an uppercase I from a sans-serif font). b) The glyph for the "combining breve" (3rd tone) is normally designed to fit on western lowercase vowels. (suggestion: if you may a bigger size for the combining marks, you might get the good result). Ciao. Marco
Re: Surrogate space in Unicode
Because of the widespread belief that Unicode stops at U+, many fonts and applications that claim to support Unicode can only handle basic characters, not supplementary characters. Right. (Is it really a widespread belief? That's something I've been wondering.) So using the plain english term "basic" to describe that subset of Unicode is misleading. I agree with you that the language in the standard needs updating. -t