Re: Four characters from Greek Extended block missing?

2001-02-16 Thread Nick NICHOLAS

On Fri, 16 Feb 2001, Otto Stolz wrote:

> So the questions are:
> - are the above-mentioned lower-case upsilon composites useless,
>   and entered Unicode only by an oversight, or
> - are their upper-case equivalents missing by an oversight, or
> - is there indeed a rationale for this anomaly?

The Upsilons with smooth breathings are unacceptable word-initially in
Attic; the only exception I found in Liddel-Scott-Jones was the old name
of the letter itself, U)=. The lowercase glyph is acceptable in Attic,
because it can occur as the second letter of an initial diphthong; in
old typographies where all-caps words had accents, this can also occur
with capital upsilon. Upsilon with smooth breathing can additionally
occur word-initially in other dialects, but these two cases are rare
enough for no standard to rush to include it.

In our corpus, initial capital upsilon with a smooth breathing occurs 37
times in a corpus of 76 million words of Greek; lower case upsilon with a
smooth breathing occurs 373 times. With epigraphical data, this will obviously
be more frequent.

-- 
Nick Nicholas. TLG, UCI, USA. [EMAIL PROTECTED]; www.tlg.uci.edu/~opoudjis
 Many among their proselytes had sold their lands and houses to increase
  the public riches of the sect --- at the expense, indeed, of their
  unfortunate children, who found themselves beggars because their
  parents had been saints. (Edward Gibbon, _Decline and Fall_.)




Unicode character encoding statistics

2001-02-16 Thread Kenneth Whistler

BTW, if anyone was wondering where I came up with the
figure 880,325 reserved unassigned code points for Unicode
3.1, here are the complete statistics for Unicode 3.0 and
Unicode 3.1:

Unicode: U 3.0   U 3.1

BMP Alphas/Symbols   10236   10238
Suppl Alphas/Symbols  1691
Han (URO)20902   20902
Han (Ext A)   65826582
Han (Ext B)  42711
Han Compat 302 302
Suppl Han Compat   542
Hangul Syllables 11172   11172

Subtotal 49194   94140

BMP Private Use   64006400
Suppl Private Use   131068  131068
Surrogate Code Points 20482048
Controls65  65
BMP Noncharacters2  34
Suppl Noncharacters 32  32
BMP Reserved  78277793
Suppl Reserved  917476  872532

The total number of code points accounted for
here is 1,114,112 (= 17 x 64K), i.e.
U+..U+10.

--Ken



Re: Four characters from Greek Extended block missing?

2001-02-16 Thread Kenneth Whistler

Otto Stolz asked:

> in the Greek Extended block, five of the lower-case characters
> do not have upper-case equivalents, viz.
>   U+1FE4 GREEK SMALL LETTER RHO WITH PSILI
>   U+1F50 GREEK SMALL LETTER UPSILON WITH PSILI
>   U+1F52 GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA
>   U+1F54 GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA
>   U+1F56 GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI
> 

> However, the missing upsilon variants escape my understanding:
> - word-initial upsilon (both lower-case and upper-case) must take
>   a breathing mark,
> - medial and final upsilons do not take breathing marks.
> So, you will either need both sorts of marks on both cases,
> or you will need only dasia on both cases (I do not remember any
> word starting with psili-upsilon, but my Greek is rather rusty).
> 
> So the questions are:
> - are the above-mentioned lower-case upsilon composites useless,
>   and entered Unicode only by an oversight, or

No. Initial upsilon with PSILI (smooth breathing) is exceedingly
rare in classical Greek, but it does occur. I find exactly two
instances in my copy of the intermediate Greek-English Lexicon
(Liddell and Scott):

One entry showing 1F54 ~ 1F56 meaning "sound to imitate a person
snuffing a feast" [sic].

And one head entry in caps showing  ,'YRXA meaning "a jar, for
pickles".

Clearly these are both "funny" words. The first is onomatopoetic,
and the second is probably a borrowing of some sort from a non-Greek
language. The vast preponderance of upsilon-initial words in classical
Greek have rough breathings.

No doubt someone with access to more extensive classical and
Byzantine Greek lexica might turn up a few other instances,
including, I am guessing, instances of 1F50 and 1F52.

> - are their upper-case equivalents missing by an oversight, or

I don't think so.

> - is there indeed a rationale for this anomaly?

The entire 1FXX set was provided by ELOT,
the Greek national body, and they had prescriptive, as well
as descriptive intent in choosing the set that they did. I suspect
that they thought that uppercase initial upsilon with a smooth
breathing would not fit their orthographic rules for polytonic
Greek (although there are instances of it in print, as in the
uppercase head entry in Liddell and Scott for "pickle jar").

And in any case, by use of the spacing breathing/accent
combinations U+1FCE, etc., plus regular uppercase upsilon,
you can represent any of the missing letters, anyway. (As I
have done above for the all caps pickle jar entry.)

> Note that the code-points where you would expect these upper-case
> upsilon compositions, viz. U+1F58 U+1F5A U+1F5C U+1F5E, are left
> unassigned (reserved).
> 
> Can anybody shade some light on this anomaly: either explain the
> underlying rationale, or acknowledge the oversight?

The Unicode take on this is that the entire block U+1F00..U+1FFE
of precomposed polytonic Greek is unnecessary, since it is
all decomposable into the regular Greek alphabet and a small
number of accents.

There clearly would be no benefit at this point in adding in
the 4 (or 5) "missing" polytonic Greek characters, since in *all*
Unicode normalization forms they would end up being decomposed into
the already existing combining character sequences that can be
used to represent them now without any character additions.

--Ken




Re: [very OT] Documentation: beyond 65,536 ; misc Semitic ?s

2001-02-16 Thread Kenneth Whistler

Elaine Keown asked:

> Within the book,  Unicode 3.0, is there somewhere a long section I 
> missed about all the stuff that happens beyond the "first 65,536," 
> in addition to surrogate stuff?  

No.

> Is there other documentation somewhere?

Yes -- in the next version of the standard. See:

http://www.unicode.org/unicode/reports/tr27/

and

http://www.unicode.org/charts/draftunicode31/
  
> 
> Today are there still 7,827 unused code values? 

Actually, there are 880,325 reserved unassigned code points
(7,793 on the BMP and 872,532 on the supplementary planes).

> Will they be unassigned until version 4.0 gels?

No. Unicode 3.1 has already been approved, and is in the
last stages of publication. After that, Unicode 3.2 will
appear, adding over 1000 more characters to the BMP. Unicode
Version 4.0 is beyond that, and will, no doubt, add another
collection of characters.

> 
> Also, is there a linguistic index to Unicode character 
> database files, saying which mention Semitic languages?

No. But simple tools like grep enable you to pull out
all instances of ARABIC, HEBREW, or SYRIAC characters, if
you want.

> 
> And finally, is there documentation somewhere on whether 3.0 
> has complete symbols for the 18 languages written in Arabic 
> script that are mentioned in the book?

I presume you are talking about letters and points, rather
than symbols per se. The consortium doesn't have any explicit
language-by-language listing of Arabic alphabets and their
correlation with the encoded characters. However, the UTC does
consider the current encoding to be complete for the languages
that are explicitly mentioned, as well as for many others written
with the Arabic script that are not explicitly mentioned.

--Ken
  



Four characters from Greek Extended block missing?

2001-02-16 Thread Otto Stolz

Hello,

in the Greek Extended block, five of the lower-case characters
do not have upper-case equivalents, viz.
  U+1FE4 GREEK SMALL LETTER RHO WITH PSILI
  U+1F50 GREEK SMALL LETTER UPSILON WITH PSILI
  U+1F52 GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA
  U+1F54 GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA
  U+1F56 GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI

The Rho with psili is indeed only needed in lower-case:
- word-initial rho (upper-case or lower-case) takes a dasia,
- a double-rho within a word can be adorned with a psili and
  a dasia, which is not done in upper-case typing,
- no other medial or final rho takes a breathing mark.

However, the missing upsilon variants escape my understanding:
- word-initial upsilon (both lower-case and upper-case) must take
  a breathing mark,
- medial and final upsilons do not take breathing marks.
So, you will either need both sorts of marks on both cases,
or you will need only dasia on both cases (I do not remember any
word starting with psili-upsilon, but my Greek is rather rusty).

So the questions are:
- are the above-mentioned lower-case upsilon composites useless,
  and entered Unicode only by an oversight, or
- are their upper-case equivalents missing by an oversight, or
- is there indeed a rationale for this anomaly?
Note that the code-points where you would expect these upper-case
upsilon compositions, viz. U+1F58 U+1F5A U+1F5C U+1F5E, are left
unassigned (reserved).

Can anybody shade some light on this anomaly: either explain the
underlying rationale, or acknowledge the oversight?

Best wishes,
  Otto Stolz



Re: Surrogate space in Unicode

2001-02-16 Thread Kenneth Whistler

Tom Lord asked:

> >   It has proven difficult to come up with convenient terms for
> >   the Unicode characters encoded at U+1 and beyond.
> >   []
> >   2.  A 'basic' code point, which may represent a 'basic
> >   character', can range from U+ through U+.
> >  
> >  For what purpose is such a distinction needed?  
> 

And Doug Ewell answered:

> It is needed because of UTF-16, which requires two 16-bit code points to 
> represent a character with a value of U+1 or higher (a supplementary 
> character) but only one 16-bit code point to represent a basic character.

This is correct, except that it is two 16-bit code *units* required to
represent supplementary characters.

For the UTF-32 encoding form, there is nothing special about supplementary
characters (characters whose Unicode scalar value, i.e. code point, is
between 0x1 and 0x10), except that they've only recently started
to be standardized.

For the UTF-8 encoding form, supplementary characters are represented in
4 bytes, while basic characters are represented in 1, 2, or 3 bytes. This
could have an implication for an implementation, although proper UTF-8
implementations should already be handling them correctly. The big issue
is for UTF-8 implementations that *incorrectly* handle supplementary
characters as sequences of two 3-byte representations of surrogate code
points. In order to talk meaningfully about those issues, a terminological
distinction between basic and supplementary characters is useful.

For the UTF-16 encoding form, as Doug pointed out, the difference is between
1 code unit versus 2 code units for representation of a code point. 
That distinction is rather significant for many Unicode implementations,
and again a terminological distinction is useful.

Finally, for comparison to ISO/IEC 10646, it is also useful to have a
terminological distinction that lines up with the international standard.
10646 has settled on the term "supplementary planes" to refer to Planes
1 through 16, so the use of the term "supplementary character" in Unicode
to refer to characters encoded on the supplementary planes makes it easier
to understand what is intended, no matter which of the two standards you
are coming from.

> 
> Many descriptions on the Web erroneously claim that Unicode contains only the 
> first 64K characters of ISO 10646.  Even the Unicode Standard Version 3.0 
> states, "Plain Unicode text consists of sequences of 16-bit character codes." 
>  To me this sentence is very misleading and requires that special attention 
> be paid to the nature of supplementary characters, those to be assigned in 
> Unicode 3.1 and those to be assigned in future versions.

That sentence will be updated eventually.

The critical piece of text in the standard is conformance clause C1 on
page 37, which currently reads:

"C1 A process shall interpret Unicode code values as 16-bit quantities.

* Unicode values can be stored in native 16-bit machine words."

In Unicode 3.1, about to be published in UAX #27, that wording is being
changed to:

"C1 A process shall interpret the Unicode code units in accordance with
the Unicode Transformation Format used.

* The Unicode Standard defines code points (scalar values) that can
be encoded in any of three transformation formats (encoding forms):
UTF-8, UTF-16, or UTF-32."

The PDUTR #27 text currently accessible on the website does not yet
show this change, which was just accepted at the recent UTC meeting,
but expect an updated text for what will eventually become UAX #27
to show up on the site in approximately a week.

--Ken



RE: Unicode Transcriptions

2001-02-16 Thread Kenneth Whistler

Thomas Chan noted:

> Yes, you are right about this.  I don't know why TUS3.0 p. 278 says "The 
> character U+3127 BOPOMOFO LETTER I is usually written as a vertical
> stroke when Bopomofo text is set vertically.", which is *wrong*.

This is a x/y axis dyslexia that set in when a text correction
was misapplied to the text. I am reporting it to errata.

--Ken



Re: Surrogate space in Unicode

2001-02-16 Thread DougEwell2

In a message dated 2001-02-16 0:19:01 Pacific Standard Time, [EMAIL PROTECTED] 
writes:

>   Because of the widespread belief that Unicode stops at U+,
>   many fonts and applications that claim to support Unicode can
>   only handle basic characters, not supplementary characters.
>  
>  Right.  (Is it really a widespread belief?  That's something I've
>  been wondering.)

Well, [EMAIL PROTECTED] seems to think so:

>  > Many descriptions on the Web erroneously claim that Unicode contains 
only the 
>  > first 64K characters of ISO 10646.
>
>  Well, AFAICT it's true.
>
>  At some point in the future I suppose it will cease to be true, but if you
>  say "is" you should be talking about the present.

Unicode has been defined as ranging from U+ to U+10 for several years 
now.  The fact that no characters have been assigned beyond U+ before 
Unicode 3.1 (which is still in beta) does not change this.

>  > Because of the widespread belief that Unicode stops at U+, many 
fonts and 
>  > applications that claim to support Unicode can only handle basic 
characters, 
>  > not supplementary characters.
>
>  The code I wrote is like that, and it'll remain like that for as long as
>  that's all that can be tested and used in real life.

You can already test private-use characters in the U+F and U+10 
ranges.  Saying that your code shouldn't have to work with characters beyond 
U+ because no such characters have been assigned yet is like saying it 
shouldn't have to support U+20B0 through U+20CF.  You know characters will be 
assigned to that range some day, possibly sooner than you think.

Back to [EMAIL PROTECTED]:

>  So using the plain english term "basic" to describe that subset
>  of Unicode is misleading.
>  
>  I agree with you that the language in the standard needs updating.

I think that has been tried already, and 'basic' was the best anyone could 
do.  Terms involving 'planes', such as 'BMP' and 'supplementary planes', are 
discouraged because planes per se are not part of Unicode, only ISO/IEC 10646.

I personally don't like 'basic' and 'supplementary' because they seem to 
imply that the first 64K code points are better in some way, but the most 
important thing is that the terminology remain consistent, even if flawed.

-Doug Ewell
 Fullerton, California



Re: Surrogate space in Unicode

2001-02-16 Thread DougEwell2

In a message dated 2001-02-16 7:56:12 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

>  It's clearer, but misses what I understand to be the absolutely crucial
>  distinction between a code point (correctly defined) and a code unit
>  (mentioned by Mark but not by Doug). For what a code unit is, see
>  http://www.unicode.org/unicode/reports/tr17

I didn't mention code units because, embarrassingly, I am still having a hard 
time telling the difference between code points and code units.  I have read 
UTR #17 many times and am still somewhat confused.  I'll try again.

>  I would question whether 'surrogate code points' are really code points. In
>  the sense that they are a subset of 'code points' as defined, I guess they
>  are; but they are not only unlike every other code point in that they "do
>  not directly represent characters", they are explicitly and inexorably
>  disqualified from so doing, being reserved for use, in pairs, as UTF-16 
code
>  units. (Which is what Mark said, of course.)

I think they would still be code points, just like 0xFFFE and 0x (and now 
others) which are guaranteed never to be characters, for a different reason.

>  Looked at in this way, surely it makes it clearer that the transcoding of a
>  surrogate (code point) into UTF-8 is an abomination.
>  
>  Simplification is all very well, but it can be taken too far, as when
>  important distinctions are lost.

Yes, that is true.  I might have known better than to respond to a "cut the 
mumbo-jumbo" post.  Einstein said, "Everything should be made as simple as 
possible, but not one bit simpler," and I think that is especially true when 
working with standards and specifications, where precise and unambiguous 
wording is crucial.

-Doug Ewell
 Fullerton, California



Myanmar questions

2001-02-16 Thread Antoine Leca

Hi folks,

I am looking Burmese, beg your pardon, Myanmar, and I can find answers from my
available sources to most of my questions, however I still have some unanswered
ones.

1) for the "au" dependent vowel, I believe (extrapolating from the one for "o")
the correct encoding is U+1031 U+102C U+1039. However the use of the virama inside
of a "matra" part looks surprising to me (and it creates problem to my renderer).

2) There appears to exist a special vowel usually named "ui", which looks like
as a combination of i (above) and u (below). How is it supposed to be encoded
in Unicode? u before i (as pronounced)? i before u (as usual with Unicode,
above before below)?

3) The vowel bearer (1021) is reported to be the one to use at intial when there
is no consonant, along with the appropriate vowel sign. However, Unicode also
encode the individual glyphs for the independant vowel which does not look like
the bearer+the vowel sign. I.e. there does not exist Long A (a space is available
at 1022), so I understand I have to encode it as U+1021 U+102C. However, for
short i, I can use either U+1021 U+102D, or U+1023. What is the preference?

4) I have in my references another glyph, which looks like 4 but with a straight
leg; it is the same as the first part of U+104E, "asformentionned". I do not know
the name of the symbol ("leng"?), nor its real use (I guess it is used only as
part of the U+104E abbreviation). However, what is the recommanded translation
for such a symbol if we encounter it in the wild?

5) I can't figure how looks like "kywe". Is it base_ka + wa_below + ya_to_the_right
(but then what is the difference with "*kwye"?), or is it base_ka + ya_to_the_right
+ a_special_wa_deep_below, the latter being under the "arch" of the ya?

Since a drawing always is easier to understand, here are my ideas:

   |   |
/\   /\|/\   /\|
   /  \ /  \   |   /  \ /  \   |
  |||  |  |||  |
  |||  |  |||  |
   \  //   |   \  //   |
__ ___/|   ___/|  __= baseline
  /\  \|  \|
 /  \  \   |   \___|
/\  \__| /\
/__\



Thanks in advance for your answer.

Antoine



[very OT] Documentation: beyond 65,536 ; misc Semitic ?s

2001-02-16 Thread Elaine Keown

Hello, 

Within the book,  Unicode 3.0, is there somewhere a long section I missed about all 
the stuff that happens beyond the "first 65,536," in addition to surrogate stuff?  Is 
there other documentation somewhere?  

Today are there still 7,827 unused code values?  Will they be unassigned until version 
4.0 gels?

Also, is there a linguistic index to Unicode character database files, saying which 
mention Semitic languages?

And finally, is there documentation somewhere on whether 3.0 has complete symbols for 
the 18 languages written in Arabic script that are mentioned in the book?  

Thanks
Elaine

Find the best deals on the web at AltaVista Shopping!
http://www.shopping.altavista.com



Re: Surrogate space in Unicode

2001-02-16 Thread J M Sykes

See end ->

- Original Message -
From: <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Friday, February 16, 2001 6:05 AM
Subject: Re: Surrogate space in Unicode


> In a message dated 2001-02-15 15:26:55 Pacific Standard Time,
[EMAIL PROTECTED]
> writes:
>
> > > At 2001-02-06 07:48:29 -0800 Mark Davis wrote:
> >  >> At 2001-02-06 01:51 "nikita k" <[EMAIL PROTECTED]> wrote:
> >  >> What is surrogate space in unicode?
> >
> >  (Mark defines various terms relating to 'supplementary' and
'surrogate')
> >
> >  So, I guess it's safe to say that a surrogate code point is
> >  a surrogate code point... which is a surrogate for a supplementary
> >  code point, which is a code point between something and something
> >  else.
> >
> >  Someone needs to take a break from the bureaucrateze and learn
> >  again how to communicate clearly.  Is that not a part of the
> >  goal, here?
>
> I thought Mark's definitions were both accurate and clear, unlike John's
> rejoinder, which was neither.
>
> It has proven difficult to come up with convenient terms for the Unicode
> characters encoded at U+1 and beyond.  The term 'surrogate' has been
> misused in an attempt to do this.  It is important to use consistent terms
> that demonstrate an understanding of what is going on.
>
> I am not a member of the Consortium, and certainly would not consider
myself
> a bureaucrat, so I wil take a stab at this in the plainest English I can
find
> that does not sacrifice accuracy.
>
> 1.  A Unicode 'code point' is a number between 0 and 1,114,111 inclusive,
> usually expressed in hexadecimal (U+ through U+10).  Not every
code
> point necessarily represents a valid character, although most do.  For
> example, there is no character encoded at U+.
>
> 2.  A 'basic' code point, which may represent a 'basic character', can
range
> from U+ through U+.  The remaining code points (U+1 through
> U+10) are 'supplementary' code points, each of which may represent a
> 'supplementary character'.
>
> 3.  'Surrogate' code points range from U+D800 through U+DFFF (not U+DC00).
> They do not directly represent characters (so there is no such thing as a
> 'surrogate character'), but two of them may be used together according to
the
> rules of UTF-16 to represent a supplementary character.  The two surrogate
> code points used for this purpose would be called a 'surrogate pair'.
Don't
> separate them.
>
> Is that better?

It's clearer, but misses what I understand to be the absolutely crucial
distinction between a code point (correctly defined) and a code unit
(mentioned by Mark but not by Doug). For what a code unit is, see
http://www.unicode.org/unicode/reports/tr17

I would question whether 'surrogate code points' are really code points. In
the sense that they are a subset of 'code points' as defined, I guess they
are; but they are not only unlike every other code point in that they "do
not directly represent characters", they are explicitly and inexorably
disqualified from so doing, being reserved for use, in pairs, as UTF-16 code
units. (Which is what Mark said, of course.)

Looked at in this way, surely it makes it clearer that the transcoding of a
surrogate (code point) into UTF-8 is an abomination.

Simplification is all very well, but it can be taken too far, as when
important distinctions are lost.

For what it's worth,

Mike.

***

J M Sykes  Email: [EMAIL PROTECTED]
97 Oakdale Drive
Heald Green
CHEADLE
Cheshire   SK8 3SN
UKTel: (44) 161 437 5413

***





RE: Unicode Transcriptions

2001-02-16 Thread Thomas Chan

On Fri, 16 Feb 2001, Marco Cimarosti wrote:

> 2) Which Chinese dialect to adopt for transliterating.

Mandarin would be the most likely.


> Notice the particularities of Bopomofo spelling:
> 
> - the sound (spelled "ong" in pinyin) is spelled "u-eng";
> - there is no "y" in "yi";
> - there is no sign to indicate the 1st tone.

[snip]
 
> Also notice that you may have a few typographical problems in producing the
> picture:
> 
> a) In most fonts, the glyph for vowel i is a horizontal line. This is only
> valid for vertical texts: in horizontal writing it should be vertical.
> (Suggestion: you may substitute it with an uppercase I from a sans-serif
> font).

Yes, you are right about this.  I don't know why TUS3.0 p. 278 says "The 
character U+3127 BOPOMOFO LETTER I is usually written as a vertical
stroke when Bopomofo text is set vertically.", which is *wrong*.

 
> b) The glyph for the "combining breve" (3rd tone) is normally designed to
> fit on western lowercase vowels. (Suggestion: if you use a bigger size for
> the combining marks, you might get a correct result).

I've made two .gif files demonstrating Bopomofo typography:

  http://deall.ohio-state.edu/grads/chan.200/misc/biaozhunwanguoma.gif
  http://deall.ohio-state.edu/grads/chan.200/misc/tongyima.gif

Both depict left-to-right Han character text, and each character is
annotated on its right side with top-to-bottom Bopomofo text.

(Alternatively, I could have created versions where the Han character text
runs top-to-bottom, and each character is annotated on its right side with
top-to-bottom Bopomofo text, but I didn't.)

Note the place of the tone diacritics, which is "stacked" even more to the
right than the Bopomofo consonants and vowels.


Thomas Chan
[EMAIL PROTECTED]




RE: Unicode Transcriptions

2001-02-16 Thread Marco Cimarosti

Subject "RE: Unicode Transcriptions"

(I am resending this message because the first version contained too many
errors even for my standards:-)

Mark Davis wrote:
> I am still missing Bopomofo,
> [...]
> Also, Ken suggested that the Bopomofo should be a Bopomofo transcription
of
> the Chinese for Unicode, not a transliteration from English. Can anyone
> supply that?

Once you accept Ken's suggestion, you have two more decisions to make:

1) Which Chinese name to use (you have two on your page, one of which is in
both simplified and traditional characters);

2) Which Chinese dialect to adopt for transliterating.

Assuming that (1) you want to use the 3-syllable name ("統一碼", which is also
used in http://www.unicode.org/unicode/standard/WhatIsUnicode.html) and that
(2) you want the official Putonghua (Mandarin) pronunciation, here is what
it would be:

Chinese:統一碼
Pinyin: tŏngyīmă
Bopomofo:   ㄊㄨㄥ̆ ㄧ ㄇㄚ̆
Codepoints: 310A 3128 3125 0306 0020 3127 0020 3107 311A 0306

Notice the particularities of Bopomofo spelling:

- the sound [uŋ] (spelled "ong" in pinyin) is spelled "u-eng";
- there is no "y" in "yi";
- there is no sign to indicate the 1st tone.

Also notice that you may have a few typographical problems in producing the
picture:

a) In most fonts, the glyph for vowel i is a horizontal line. This is only
valid for vertical texts: in horizontal writing it should be vertical.
(Suggestion: you may substitute it with an uppercase I from a sans-serif
font).

b) The glyph for the "combining breve" (3rd tone) is normally designed to
fit on western lowercase vowels. (Suggestion: if you use a bigger size for
the combining marks, you might get a correct result).

Ciao.
Marco



re: Unicode Transcriptions

2001-02-16 Thread Marco Cimarosti

Hi Mark.

You wrote:
> I am still missing Bopomofo,
> [...]
> Also, Ken suggested that the Bopomofo should be a Bopomofo transcription
of
> the Chinese for Unicode, not a transliteration from English. Can anyone
> supply that?

Once you accept Ken's suggestion, you have two more decisions to make:

1) Which Chinese name to use (you have two on your page);

2) Which Chinese dialect to adopt for transliterating.

Assuming that (1) you want to use the 3-syllable name ("統一碼", which is also
used in ) and that (2) you want the official Putonghua (Mandarin)
pronunciation, here is what it would be:

Chinese:統一碼
Pinyin: tŏngyīmă
Bopomofo:   ㄊㄨㄥ̆ ㄧ ㄇㄚ̆
Unicodes:   310A 3128 3125 0306 0020 3127 0020 3107 311A 0306

Notice the particularities of Bopomofo spelling:

- the sound [uŋ] ("ong" in pinyin) is spelled "u-eng";
- there is no "y" in "yi";
- there is no sign to indicate the 1st tone.

Also notice that you may have a few typographical problems in producing the
picture:

a) In most fonts, the glyph for vowel i is a horizontal line. This is only
valid for vertical texts: in horizontal spelling it should be vertical.
(Suggestion: you may substitute it with an uppercase I from a sans-serif
font).

b) The glyph for the "combining breve" (3rd tone) is normally designed to
fit on western lowercase vowels. (suggestion: if you may a bigger size for
the combining marks, you might get the good result).

Ciao.
Marco



Re: Surrogate space in Unicode

2001-02-16 Thread Tom Lord


Because of the widespread belief that Unicode stops at U+,
many fonts and applications that claim to support Unicode can
only handle basic characters, not supplementary characters.

Right.  (Is it really a widespread belief?  That's something I've
been wondering.)

So using the plain english term "basic" to describe that subset
of Unicode is misleading.

I agree with you that the language in the standard needs updating.

-t