Re: statistics
On 10/11/2010 9:49 PM, Janusz S. Bień wrote: On Mon, 11 Oct 2010 announceme...@unicode.org wrote: The newly finalized Unicode Version 6.0 adds 2,088 characters, What is the current total? Are other statistic informations available somewhere? The announcement gives a link to click through. There you will find more statistics. A./ Best regards JSB
Re: statistics
On Mon, 11 Oct 2010 Asmus Freytag asm...@ix.netcom.com wrote: On 10/11/2010 9:49 PM, Janusz S. Bień wrote: On Mon, 11 Oct 2010 announceme...@unicode.org wrote: The newly finalized Unicode Version 6.0 adds 2,088 characters, What is the current total? Are other statistic informations available somewhere? The announcement gives a link to click through. There you will find more statistics. I guess you mean Character Assignment Overview at http://www.unicode.org/versions/Unicode6.0.0/ However it does not provide the precise answer to my primary question, which is not purely arithmetic but depends on the definition of the character. In particular, do noncharacters belong to characters? Regards JSB -- , dr hab. Janusz S. Bien, prof. UW - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - Warsaw University (Department of Formal Linguistics) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
Re: statistics
2010/10/12 Janusz S. Bień jsb...@mimuw.edu.pl: The newly finalized Unicode Version 6.0 adds 2,088 characters, What is the current total? Are other statistic informations available somewhere? However it does not provide the precise answer to my primary question, which is not purely arithmetic but depends on the definition of the character. In particular, do noncharacters belong to characters? The Wikipedia article on Unicode gives the current total, and explains what the various categories of characters are: http://en.wikipedia.org/wiki/Unicode I give a detailed break down of character statistics by Unicode version (from 1.0.0 to 6.0) at: http://babelstone.blogspot.com/2005/11/how-many-unicode-characters-are-there.html Andrew
FW: statistics
FW to Unicode ml From: ernestvandenbooga...@hotmail.com To: jsb...@mimuw.edu.pl Subject: RE: statistics Date: Tue, 12 Oct 2010 10:13:17 +0200 In 5.2, Chapter 2.4 table 2-3 is listed which General Categories are characters. Out are: Surrogates, Private Use, Non-characters and Reserved code points. Note that Format characters (Cf) are included as characters. The code points with formatting aspects in C0 and C1 are Controls (Cc), so excluded. Total number of characters in 6.0 is 109,242+142=109,384. Regards, Ernest van den Boogaard From: jsb...@mimuw.edu.pl To: asm...@ix.netcom.com CC: unicode@unicode.org Subject: Re: statistics Date: Tue, 12 Oct 2010 09:14:21 +0200 On Mon, 11 Oct 2010 Asmus Freytag asm...@ix.netcom.com wrote: On 10/11/2010 9:49 PM, Janusz S. Bień wrote: On Mon, 11 Oct 2010 announceme...@unicode.org wrote: The newly finalized Unicode Version 6.0 adds 2,088 characters, What is the current total? Are other statistic informations available somewhere? The announcement gives a link to click through. There you will find more statistics. I guess you mean Character Assignment Overview at http://www.unicode.org/versions/Unicode6.0.0/ However it does not provide the precise answer to my primary question, which is not purely arithmetic but depends on the definition of the character. In particular, do noncharacters belong to characters? Regards JSB -- , dr hab. Janusz S. Bien, prof. UW - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - Warsaw University (Department of Formal Linguistics) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
Re: statistics
Ernest van den Boogaard wrote: In 5.2, Chapter 2.4 table 2-3 is listed which General Categories are characters. Out are: Surrogates, Private Use, Non-characters and Reserved code points. Note that Format characters (Cf) are included as characters. The code points with formatting aspects in C0 and C1 are Controls (Cc), so excluded. I don't understand why any control characters would be excluded from a count of characters. -- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s
statistics (was: Unicode Version 6.0: Support for Popular Symbols in Asia)
On Mon, 11 Oct 2010 announceme...@unicode.org wrote: The newly finalized Unicode Version 6.0 adds 2,088 characters, What is the current total? Are other statistic informations available somewhere? Best regards JSB -- , dr hab. Janusz S. Bien, prof. UW - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - Warsaw University (Department of Formal Linguistics) jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
OFF-TOPIC character set usage statistics ???
I seem to remember that someone recently posted a link to some statistics on character set usage, but I can't seem to find it in my old messages. Can anyone help? John. -- -- Over 1500 webcams from ski resorts around the world - http://www.snoweye.com/ -- Translate your technical documents and web pages- http://www.tradoc.fr/
RE: Some Char. to Glyph Statistics, Pan/Single Font
So does my Rurouni Kensin album go under R or under ru? Maybe ru is better because few words start with ru. $B!z$8$e$&$$$C$A$c$s!z(B "AIS TSXQ QDOO TD AISC TDQMIG, HYCTDL, ZIC HIIUPLB XSHM GDOPHPISX CYTDL." "QMD XDHCDQ, AIS XDD, PX QMDCD'X LI CDHPWD. P VSXQ WSQ RMYQ P MYED KA TA YCT PL."
Some Char. to Glyph Statistics, Pan/Single Font
The problem with your glyph statistics is that they are based on mould counts employed by the Monotype hot metal typesetters. The Monotype system was capable of extensive kerning, and therefore many glyphs were constructed from the elements provided by the moulds at the time of composition. The Monotype list of elements therefore comprises: Full characters which areeither basic or couldnot be composed satisfactorily by the system for whatever reason. These might properly be described as glyphs Elements which were combined either with the first set, or with one another, to create glyphs, or approximations to glyphs at the time of casting. These cannot really be considered to be glyphs, as such. However, if one allows that these elements are glyphs, the real number of glyphs employed by Monotype was limited by the matrix case: before 1962 to 225 sorts, and subsequently to 272 sorts. Although additional sorts might be available, they could only be used by substitution with another sort prior to any actual typesetting. More recent Monotype code pages for Bengali seem to be around 450elements, which are combined with floating elements to create text. To date all Indic script composition has been pretty much limited by technology. Taking Bengali as an example, Figgins, around 1826, employed 370 sorts, many of which are kerning versions of other sorts, allowing the composition either of consonant-vowel combinations or approximations to complex conjuncts which were insufficiently common to warrant the creation of separate punches. But again, a number of his sorts exist only to allow the incorporation of combinations which could not be produced by the technology of the time. Our recent revision of the Linotype Bengali code page extends to a font of some 980 elements. 136 of these are differently spaced floating elements, such vowel signs and chandrabindus, which haveno meaning separate from the main characters to which they may be attached, and which would be omitted from an opentype version.It also includes 146 characters whichduplicate the Unicode encoded Bengali characters, which is required for current technological reasons - Microsoft's Office XP does not allow the display of Unicode encode Bengali characters in the font, or at the size which is expected. So the "real" number of elements is 698.(I may also add that we have had to produce alternative versions of the same fonts in which non-spacing elements actually space quiteconsiderably, because ofthe very strange behaviour of Microsoft's Internet Explorer 5.5, so the glyph count islarger than the 980 - another case of technology determining counts). Turning to Devanagari, our researches indicate that the totalnumber of script units (In Unicode terms, combinations of consonants, halants, vowel signs and other signs), excluding the Unicode charactersin the range 0951 to 0954, in use is around the 5550 mark. It is actually greater than this, since there are a number of characters relating to Sanskrit sandhi for which we do not have any conjunct-vowel statistics. In principle, all these should be regarded asglyphs, thoughfew fonts are likely to implement them all (the slaves in this context needing to be human beings, since the issue of the spacing and modification of a smaller number of base elements to produce all these glyphs is an aesthetic rather than a mechanical problem) I have also not included in the count the many variant forms of glyphs which occur as result of differences in formulation for particular combinations. (I have also excluded the rather large number of glyphs which are to be found in the Mangal font supplied by Microsoft, but which seem to be there purely because of a rather strange and literal interpretation of the Unicode Devanagari shaping rules, on the grounds that these glyphs exist only in the font, and would never be used in text.)
RE: Some Char. to Glyph Statistics, Pan/Single Font
Hi. Well, it can be said to be above the minimum :-) depending on how you look at things. If you're a developer of embedded device with a really stringent requirement in memory footprint (for font and others), you may just go with 1:1 ratios for all three groups of Jamos (consonants and vowels) as found in old (mechanical) Hangul typewriters. However, as you can guess, the result is not pleasing to most eyes. Of course. If the requirements are even more stringent (e.g., the user is blind) you can even represent the letters with a 2x3 matrix of pixels. Similarly, when I was a child, the first companies that started using electronic brains to bill customers sent notes printed in all capital letters and with no apostrophes. The minimal model that I have in mind is slightly less minimal: the least quality that won't sacrifice the normal orthographic rules of a language. Ciao. Marco
RE: Some Char. to Glyph Statistics, Pan/Single Font
Mike Meir wrote: The problem with your glyph statistics is that they are based on mould counts employed by the Monotype hot metal typesetters. I agree: no one will ever come up with *the* correct count. Such general evaluations simply depend on too many things to be useful. E.g.: which language(s) are targeted, what degree of typographic excellence is required, and (as Mike explained very well) the kind of technology involved and its limitations. The simple fact that software fonts can overlay glyphs can cause a great factor of reduction, compared to lead type. Similarly, the fact that a software font technology has the capability of kerning glyphs vertically can reduce dramatically the inventory of glyphs needed for certain scripts. Moreover, different technologies may have totally different meanings for the word glyph. E.g., I have heard of Arabic fonts that analyze the Arabic script well under the level of a grapheme: segments of lines and individual dots were stored separately and assembled at display time. Comparing the number of glyphs in such an a font with the inventory of a more traditional font is what we call sum up apples and pears. Turning to Devanagari, our researches indicate that the total number of script units (In Unicode terms, combinations of consonants, halants, vowel signs and other signs), excluding the Unicode characters in the range 0951 to 0954, in use is around the 5550 mark. It is actually greater than this, since there are a number of characters relating to Sanskrit sandhi for which we do not have any conjunct-vowel statistics. As an opposite example for Devanagari, I did a little research on my own on a minimal rendering scheme for Unicode Indic scripts. The scenario behind this evaluation was low-resolution displays or printers and simple bitmapped fonts. For Devanagari's 77 characters (non-decomposable L and M characters) my set of glyphs was just 82 pieces. Of course, such a ratio (about 1:1.06) requires dropping any typographical gracefulness: of all the complexity of Devanagari, just a handful of half-consonants and ligatures was preserved. Neither your 5550 nor my 82 are of much use to anyone who has even slightly different requirements. However, the contrast between these two figures perhaps says something about the difficulty of such a count. _ Marco
RE: Some Char. to Glyph Statistics, Pan/Single Font
Jungshik Shin wrote: I think I know how you counted (initial consonants: two for syllables with and without final consonants, three for three kinds of vowel position/shape, vowels: two for syll. with/without final consonants) and think you got it right. You caught me with hands in jam: that was exactly my way of thinking. While I see that this is clearly too naive to be right, I would not be able to improve it any further myself. I welcome any refinement. Especially, I was curious about the other ratios (DOS 1:8,1:4,1:4; X11win 1:10,1:3,1:4; TrueType 1:~30) that you mentioned on your previous message. _ Marco
RE: Some Char. to Glyph Statistics, Pan/Single Font
Thursday, May 31, 2001 My goal was never to give a specific number of glyphs needed to display a particular Indian or other script. As others have pointed out, this depends among other things, on the particular display device and its font processing software possibly including the operating system. My goals were to point out that Arabic and South and Southeast Asian scripts require: 1. Many more glyphs than character codes and, 2. As important, software to render character codes legibly from the available glyphs. Discussions of a single Unicode font that do not mention such software seem pointless, or worse, managers might believe them. I wonder it we could usefully define levels of legibility for displaying a language or writing system, or is it too subjective? Is evoking a lam alef ligature when alef follows a lam the minimal level for any language using Arabic script? For languages using Devanagari script is transposing the short i matra (U+093F) to precede the consonant(s) it follows the minimum? Regards, Jim Agenbroad (disclaimer and address at bottom) On Thu, 31 May 2001, Marco Cimarosti wrote: Mike Meir wrote: The problem with your glyph statistics is that they are based on mould counts employed by the Monotype hot metal typesetters. I agree: no one will ever come up with *the* correct count. Such general evaluations simply depend on too many things to be useful. E.g.: which language(s) are targeted, what degree of typographic excellence is required, and (as Mike explained very well) the kind of technology involved and its limitations. The simple fact that software fonts can overlay glyphs can cause a great factor of reduction, compared to lead type. Similarly, the fact that a software font technology has the capability of kerning glyphs vertically can reduce dramatically the inventory of glyphs needed for certain scripts. Moreover, different technologies may have totally different meanings for the word glyph. E.g., I have heard of Arabic fonts that analyze the Arabic script well under the level of a grapheme: segments of lines and individual dots were stored separately and assembled at display time. Comparing the number of glyphs in such an a font with the inventory of a more traditional font is what we call sum up apples and pears. Turning to Devanagari, our researches indicate that the total number of script units (In Unicode terms, combinations of consonants, halants, vowel signs and other signs), excluding the Unicode characters in the range 0951 to 0954, in use is around the 5550 mark. It is actually greater than this, since there are a number of characters relating to Sanskrit sandhi for which we do not have any conjunct-vowel statistics. As an opposite example for Devanagari, I did a little research on my own on a minimal rendering scheme for Unicode Indic scripts. The scenario behind this evaluation was low-resolution displays or printers and simple bitmapped fonts. For Devanagari's 77 characters (non-decomposable L and M characters) my set of glyphs was just 82 pieces. Of course, such a ratio (about 1:1.06) requires dropping any typographical gracefulness: of all the complexity of Devanagari, just a handful of half-consonants and ligatures was preserved. Neither your 5550 nor my 82 are of much use to anyone who has even slightly different requirements. However, the contrast between these two figures perhaps says something about the difficulty of such a count. _ Marco Regards, Jim Agenbroad ( [EMAIL PROTECTED] ) The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.
RE: Some Char. to Glyph Statistics, Pan/Single Font
At 5:35 PM +0200 5/31/01, Marco Cimarosti wrote: Jungshik Shin wrote: I think I know how you counted (initial consonants: two for syllables with and without final consonants, three for three kinds of vowel position/shape, vowels: two for syll. with/without final consonants) and think you got it right. You caught me with hands in jam: that was exactly my way of thinking. While I see that this is clearly too naive to be right, I would not be able to improve it any further myself. I welcome any refinement. Especially, I was curious about the other ratios (DOS 1:8,1:4,1:4; X11win 1:10,1:3,1:4; TrueType 1:~30) that you mentioned on your previous message. _ Marco A quick look at the Hangul syllable table starting on page 744 of TOS3 shows a much greater variation. If you look at the pages slightly cross-eyed so that each glyph aligns with a neighbor, and wink each eye alternately, you can get the effect of a blink comparator of the type used in astronomy before computer image processing became practical. If you can't keep the alignment while winking, just look for the fuzzy letters where the glyphs don't match up. Or we could ask a typographer. :-) -- Edward Cherlin Generalist A knot! exclaimed Alice. Oh, do let me help to undo it. Alice in Wonderland
RE: Some Char. to Glyph Statistics, Pan/Single Font
At 5:12 PM +0200 5/31/01, Marco Cimarosti wrote: Hi. Well, it can be said to be above the minimum :-) depending on how you look at things. If you're a developer of embedded device with a really stringent requirement in memory footprint (for font and others), you may just go with 1:1 ratios for all three groups of Jamos (consonants and vowels) as found in old (mechanical) Hangul typewriters. However, as you can guess, the result is not pleasing to most eyes. The manual Hangul typewriter I learned on had multiple forms for initial consonants, supplied by means of an extra shift level. (Yes! A mechanical buckybit!! %-[ ) The really minimal level was *linear* Hangul produced by the telegraph system. [snip] The minimal model that I have in mind is slightly less minimal: the least quality that won't sacrifice the normal orthographic rules of a language. Which rules are the normal ones? Every publisher I've had anything to do with has used different sets of rules, over quite a wide range. We can't even agree whether ligatures are required in English, or whether an ASCII-sorted index is sufficiently human-readable. Ciao. Marco -- Edward Cherlin Generalist A knot! exclaimed Alice. Oh, do let me help to undo it. Alice in Wonderland
Some Char. to Glyph Statistics, Pan/Single Font
Wednesday, May 30, 2001 Attached is a note I wrote in September 1993 about the ratio of characters to glyphs in several Indic scripts. Much has changed on the Unicode front since then, but I think the need for rendering software to decide which of many glyphs to use to represent a given sequence of codes is still with us. A similar situation obtains with Arabic--unless one requires the use of Arabic presentation forms. If one excludes the combining characters at U+0300 to 0362 European scripts tend to have a 1:1 character to glyph ratio; Chinese, Japanese and (maybe Korean) scripts also tend to have a 1:1 character to glyph ratio. But most scripts between Europe and the Far East--Arabic, South and Southeast Asian ones do not. Unless the rendering software and the fonts are in synch the results will be unsatisfactory. A few posting on the 'single font' discussion have mentioned this but I hope some data may be helpful. The story goes that back in Ancient Greece (I think) someone was describing Utopia and a listener asked, But who will do the work? and the reply was, Oh, we will have slaves. The computer now can be an effective slave when given explicit instructions, but without consistent instructions the results will not be satisfactory. This may be beyond the scope of Unicode which aims to unambiguously encode text for the computer (and succeeds) but does not dwell on details of its input or output--rendering it legible for humans to read. Regards, Jim Agenbroad ( [EMAIL PROTECTED] ) The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A. -- Forwarded message -- Date: Fri, 10 Sep 93 14:12:07 -0400 From: jage (James E. Agenbroad) To: [EMAIL PROTECTED] Cc: jage@seq1 Subject: Some Character to Glyph Statistics Friday, September 10, 1993 Glenn, Recent Internet discussions about fonts for ISO10646/Unicode prompted me to do some counting. The data are suggestive rather than definitive at least in part because the counts of glyphs are based on only a single source and it may not be up to date. They do suggest that for various writing systems of South (and maybe Southeast) Asia based on Indic scripts the ratio of coded characters to glyphs is not 1:1 but 1:2 or even 1:3. I'm sure this is no surprise to you but the Internet discussions make no meniton of it so I thought I would. When a writing system has more glyphs than characters I think there must be software to decide when which glyph is wanted. (This software may also need to know something about the target device but that's not an issue I can shed any light on.) As a preliminary assessment I have counted the number of character codes ISO 10646 assigns for several writing systems and the number of glyphs from synopses of the same writing systems as found in Specimen book of 'Monotype non-latin faces issued loose-leaf by Monotype Corporation. I geve the number and date of each sheet. In counting I have omitted western style punctuation and numerals. Writing System, date 10646 Mono. Rough chars glyphsratio Bengali 470,5/6589 331 1:3 Burmese 558,5/6476 213 1:3 Devanagari155,8/75 104 248 1:2.5 Gujarthi 460,7/71 75 232 1:3 Gurmukhi 601,9/74 74 146 1:2 Kannada 588,9/6980 236 1:3 Malayalam 590,7/75 78 590 1:7 Oriya 706,3/70 78 371 1:4 Sinhalese 557,1/64 90 348 1:3.5 Tamil 280,1/64 61 171 1:3 Telugu 626,3/71 80 312 1:4 Thai 577,4/74 92 208 1:2 Tibetan (Van Osterman) 80 158 1:2 For Sinhalese and Tibetan (not in 10646 yet) the count is from Unicode Technical report no. 2. For Devanagari and Gurmukhi has a note: A special mould is required for these matrices. THe relation of these fonts to current systems is unclear. As noted, my Monotype book does not include Tibetan, the glyphs are from George Vvan Ostermann's Manual of foreign languages 4th ed. 1952--Icounted the leters, ligtures, numerals, vowel signs and punctuation. I would also like to expres my agreement with the man from New South Wales who said libraries will need to display lots of different characters. I do not know if this means one large font or m any so long as they are all available when needed to display a string of ccharacter codes--without the recipent knowing what will be needed and taking extra measures to load the proper font. The fonts for such purposes would not need to have extremely
Re: Some Char. to Glyph Statistics, Pan/Single Font
You may be interested by Creating and supporting OpenType fonts for Indic scripts and Creating and supporting OpenType fonts for Arabic scripts, both available at http://www.microsoft.com/typography/tt/tt.htm. To give a little bit of context, the OpenType architecture separates shaping in two parts: the part that is script-dependent but font-independent (embodied on Windows in the Uniscribe engine), and the part that is font-dependent (embodied in GSUB/GPOS/GDEF/BASE tables in fonts). The GPOS/GSUB tables are best conceived as shaping subprograms stored in the font, and those subprograms are called by Uniscribe. The documents above describe the API between the shaping engine and the fonts. I am not aware of similar material for AAT fonts, but that's another place to look at. Also, Latin cursive fonts tend to have a large number of glyphs: ligatures to simulate the connectivity of the individual letters, and variants to simulate the randomness of hand writing. This is not much different from Arabic fonts, not surprisingly. Eric.
Re: Some Char. to Glyph Statistics, Pan/Single Font
On Wed, 30 May 2001, James E. Agenbroad wrote: Thank you for interesting piece of information. Wednesday, May 30, 2001 Attached is a note I wrote in September 1993 about the ratio of characters to glyphs in several Indic scripts. Much has changed on the Unicode front since then, but I think the need for rendering software to decide character to glyph ratio; Chinese, Japanese and (maybe Korean) scripts also tend to have a 1:1 character to glyph ratio. But most scripts In case of Korean Hangul, your 'maybe' can be justified because the situation is not so simple. If you only consider pre-composed syllable block beg. at U+AC00 and have fonts with pre-composed glyphs for all of those syllables, it could be 1:1. However, if you turn your eyes to U1100 Hangul Consonant/Vowel block and want to have a full-fledged support of medivial Korean, the ratio can be anybody's guess from 1:1 (poor quality,unconventional shape) to 1:n to m to n (where n can be a few tens if not more). In 1980's, typical MS-DOS based programs(or Hangul rendering libraries/engines) used something like 1:8, 1:4, 1:4 for initial consonants, medial vowels, and final consonants, respectively. A Korean variant of xterm (a terminal emulator for X11 window system) has been using fonts with 1:10,1:3,1:4 ratio. Some high quality true-type fonts for Hangul these days (internally) have 1:n (n ~ 30), I believe. -- Forwarded message -- Date: Fri, 10 Sep 93 14:12:07 -0400 From: jage (James E. Agenbroad) Subject: Some Character to Glyph Statistics Recent Internet discussions about fonts for ISO10646/Unicode prompted me to do some counting. The data are suggestive rather than definitive at least in part because the counts of glyphs are based on only a single source and it may not be up to date. They do suggest that for various writing systems of South (and maybe Southeast) Asia based on Indic scripts the ratio of coded characters to glyphs is not 1:1 but 1:2 or even 1:3. I thought (without any basis and hard data. that is, it was just my wild guess) the ratio would be much higher than 1:3 for Indic scripts. With the ratio being only 1:3 or so, I guess Indic scripts are in much a better shape to be supported than medivial (and some elements of modern) Korean. Projects like Pango (http://www.pango.org) have already begun to support Indic and Thai scripts let alone other commercial and non-commercial implementations (Uniscribe,AAT, Graphite,...). Therefore, eight years since your original message haven't been wasted, I think :-) Jungshik Shin
Unicode character encoding statistics
BTW, if anyone was wondering where I came up with the figure 880,325 reserved unassigned code points for Unicode 3.1, here are the complete statistics for Unicode 3.0 and Unicode 3.1: Unicode: U 3.0 U 3.1 BMP Alphas/Symbols 10236 10238 Suppl Alphas/Symbols 1691 Han (URO)20902 20902 Han (Ext A) 65826582 Han (Ext B) 42711 Han Compat 302 302 Suppl Han Compat 542 Hangul Syllables 11172 11172 Subtotal 49194 94140 BMP Private Use 64006400 Suppl Private Use 131068 131068 Surrogate Code Points 20482048 Controls65 65 BMP Noncharacters2 34 Suppl Noncharacters 32 32 BMP Reserved 78277793 Suppl Reserved 917476 872532 The total number of code points accounted for here is 1,114,112 (= 17 x 64K), i.e. U+..U+10. --Ken