Re: statistics

2010-10-12 Thread Asmus Freytag

 On 10/11/2010 9:49 PM, Janusz S. Bień wrote:

On Mon, 11 Oct 2010  announceme...@unicode.org wrote:


  The newly finalized Unicode Version 6.0 adds 2,088 characters,

What is the current total? Are other statistic informations available
somewhere?

The announcement gives a link to click through.

There you will find more statistics.

A./

Best regards

JSB






Re: statistics

2010-10-12 Thread Janusz S. Bień
On Mon, 11 Oct 2010  Asmus Freytag asm...@ix.netcom.com wrote:

   On 10/11/2010 9:49 PM, Janusz S. Bień wrote:
 On Mon, 11 Oct 2010  announceme...@unicode.org wrote:

   The newly finalized Unicode Version 6.0 adds 2,088 characters,
 What is the current total? Are other statistic informations available
 somewhere?
 The announcement gives a link to click through.

 There you will find more statistics.

I guess you mean Character Assignment Overview at

  http://www.unicode.org/versions/Unicode6.0.0/

However it does not provide the precise answer to my primary question,
which is not purely arithmetic but depends on the definition of the
character. In particular, do noncharacters belong to characters?

Regards

JSB

-- 
 ,   
dr hab. Janusz S. Bien, prof. UW -  Uniwersytet Warszawski (Katedra Lingwistyki 
Formalnej)
Prof. Janusz S. Bien - Warsaw University (Department of Formal Linguistics)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/




Re: statistics

2010-10-12 Thread Andrew West
2010/10/12 Janusz S. Bień jsb...@mimuw.edu.pl:

   The newly finalized Unicode Version 6.0 adds 2,088 characters,
 What is the current total? Are other statistic informations available
 somewhere?

 However it does not provide the precise answer to my primary question,
 which is not purely arithmetic but depends on the definition of the
 character. In particular, do noncharacters belong to characters?

The Wikipedia article on Unicode gives the current total, and explains
what the various categories of characters are:

http://en.wikipedia.org/wiki/Unicode

I give a detailed break down of character statistics by Unicode
version (from 1.0.0 to 6.0) at:

http://babelstone.blogspot.com/2005/11/how-many-unicode-characters-are-there.html

Andrew




FW: statistics

2010-10-12 Thread Ernest van den Boogaard

FW to Unicode ml

From: ernestvandenbooga...@hotmail.com
To: jsb...@mimuw.edu.pl
Subject: RE: statistics
Date: Tue, 12 Oct 2010 10:13:17 +0200








In 5.2, Chapter 2.4 table 2-3 is listed which General Categories are 
characters. Out are: Surrogates, Private Use, Non-characters and Reserved 
code points. Note that Format characters (Cf) are included as characters. The 
code points with formatting aspects in C0 and C1 are Controls (Cc), so 
excluded.

Total number of characters in 6.0 is 109,242+142=109,384.

Regards,
Ernest van den Boogaard

 From: jsb...@mimuw.edu.pl
 To: asm...@ix.netcom.com
 CC: unicode@unicode.org
 Subject: Re: statistics
 Date: Tue, 12 Oct 2010 09:14:21 +0200
 
 On Mon, 11 Oct 2010  Asmus Freytag asm...@ix.netcom.com wrote:
 
On 10/11/2010 9:49 PM, Janusz S. Bień wrote:
  On Mon, 11 Oct 2010  announceme...@unicode.org wrote:
 
The newly finalized Unicode Version 6.0 adds 2,088 characters,
  What is the current total? Are other statistic informations available
  somewhere?
  The announcement gives a link to click through.
 
  There you will find more statistics.
 
 I guess you mean Character Assignment Overview at
 
   http://www.unicode.org/versions/Unicode6.0.0/
 
 However it does not provide the precise answer to my primary question,
 which is not purely arithmetic but depends on the definition of the
 character. In particular, do noncharacters belong to characters?
 
 Regards
 
 JSB
 
 -- 
  ,   
 dr hab. Janusz S. Bien, prof. UW -  Uniwersytet Warszawski (Katedra 
 Lingwistyki Formalnej)
 Prof. Janusz S. Bien - Warsaw University (Department of Formal Linguistics)
 jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
 
 
  

Re: statistics

2010-10-12 Thread Doug Ewell

Ernest van den Boogaard wrote:

In 5.2, Chapter 2.4 table 2-3 is listed which General Categories are 
characters. Out are: Surrogates, Private Use, Non-characters and 
Reserved code points. Note that Format characters (Cf) are included as 
characters. The code points with formatting aspects in C0 and C1 are 
Controls (Cc), so excluded.


I don't understand why any control characters would be excluded from a 
count of characters.


--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­




statistics (was: Unicode Version 6.0: Support for Popular Symbols in Asia)

2010-10-11 Thread Janusz S. Bień
On Mon, 11 Oct 2010  announceme...@unicode.org wrote:

  The newly finalized Unicode Version 6.0 adds 2,088 characters, 

What is the current total? Are other statistic informations available
somewhere?

Best regards

JSB

-- 
 ,   
dr hab. Janusz S. Bien, prof. UW -  Uniwersytet Warszawski (Katedra Lingwistyki 
Formalnej)
Prof. Janusz S. Bien - Warsaw University (Department of Formal Linguistics)
jsb...@uw.edu.pl, jsb...@mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/



OFF-TOPIC character set usage statistics ???

2001-08-01 Thread John Wilcock

I seem to remember that someone recently posted a link to some
statistics on character set usage, but I can't seem to find it in my
old messages. Can anyone help? 

John.

-- 
-- Over 1500 webcams from ski resorts around the world - http://www.snoweye.com/
-- Translate your technical documents and web pages- http://www.tradoc.fr/




RE: Some Char. to Glyph Statistics, Pan/Single Font

2001-06-01 Thread $B$F$s$I$&$j$e$&$8(B
So does my Rurouni Kensin album go under R or under ru?
Maybe ru is better because few words start with ru.


$B!z$8$e$&$$$C$A$c$s!z(B

"AIS TSXQ QDOO TD AISC TDQMIG, HYCTDL,
ZIC HIIUPLB XSHM GDOPHPISX CYTDL."
"QMD XDHCDQ, AIS XDD,
PX QMDCD'X LI CDHPWD.
P VSXQ WSQ RMYQ P MYED KA TA YCT PL."


Some Char. to Glyph Statistics, Pan/Single Font

2001-05-31 Thread Mike Meir



The problem with your glyph statistics is that they 
are based on mould counts employed by the Monotype hot metal 
typesetters.
The Monotype system was capable of extensive 
kerning, and therefore many glyphs were constructed from the elements provided 
by the moulds at the time of composition. The Monotype list of elements 
therefore comprises:

  Full characters which areeither basic or 
  couldnot be composed satisfactorily by the system for whatever reason. 
  These might properly be described as glyphs
  Elements which were combined either with the first 
  set, or with one another, to create glyphs, or approximations to glyphs at the 
  time of casting. These cannot really be considered to be glyphs, as 
  such.
However, if one allows that these elements are 
glyphs, the real number of glyphs employed by Monotype was limited by the matrix 
case: before 1962 to 225 sorts, and subsequently to 272 sorts. Although 
additional sorts might be available, they could only be used by substitution 
with another sort prior to any actual typesetting.

More recent Monotype code pages for Bengali seem to 
be around 450elements, which are combined with floating elements to create 
text.

To date all Indic script composition has been 
pretty much limited by technology. Taking Bengali as an example, Figgins, around 
1826, employed 370 sorts, many of which are kerning versions of other sorts, 
allowing the composition either of consonant-vowel combinations or 
approximations to complex conjuncts which were insufficiently common to warrant 
the creation of separate punches. But again, a number of his sorts exist only to 
allow the incorporation of combinations which could not be produced by the 
technology of the time.

Our recent revision of the Linotype Bengali code 
page extends to a font of some 980 elements. 136 of these are differently spaced 
floating elements, such vowel signs and chandrabindus, which haveno 
meaning separate from the main characters to which they may be attached, and 
which would be omitted from an opentype version.It also includes 146 
characters whichduplicate the Unicode encoded Bengali characters, which is 
required for current technological reasons - Microsoft's Office XP does not 
allow the display of Unicode encode Bengali characters in the font, or at the 
size which is expected. So the "real" number of elements is 698.(I may 
also add that we have had to produce alternative versions of the same fonts in 
which non-spacing elements actually space quiteconsiderably, because 
ofthe very strange behaviour of Microsoft's Internet Explorer 5.5, so the 
glyph count islarger than the 980 - another case of technology determining 
counts).

Turning to Devanagari, our researches indicate that 
the totalnumber of script units (In Unicode terms, combinations of 
consonants, halants, vowel signs and other signs), excluding the Unicode 
charactersin the range 0951 to 0954, in use is around the 5550 mark. It is 
actually greater than this, since there are a number of characters relating to 
Sanskrit sandhi for which we do not have any conjunct-vowel 
statistics.

In principle, all these should be regarded 
asglyphs, thoughfew fonts are likely to implement them all (the 
slaves in this context needing to be human beings, since the issue of the 
spacing and modification of a smaller number of base elements to produce all 
these glyphs is an aesthetic rather than a mechanical problem)

I have also not included in the count the many 
variant forms of glyphs which occur as result of differences in formulation for 
particular combinations.

(I have also excluded the rather large number of 
glyphs which are to be found in the Mangal font supplied by Microsoft, but which 
seem to be there purely because of a rather strange and literal interpretation 
of the Unicode Devanagari shaping rules, on the grounds that these glyphs exist 
only in the font, and would never be used in text.)







RE: Some Char. to Glyph Statistics, Pan/Single Font

2001-05-31 Thread Marco Cimarosti

Hi.

Well, it can be said to be above  the minimum :-) depending on
 how you look at things. If you're a developer of embedded 
 device with a
 really stringent requirement in memory footprint (for font 
 and others),
 you may just go with 1:1 ratios for all three groups of Jamos 
 (consonants
 and vowels) as found in old (mechanical) Hangul typewriters. However,
 as you can guess, the result is not pleasing to most eyes.

Of course. If the requirements are even more stringent (e.g., the user is
blind) you can even represent the letters with a 2x3 matrix of pixels.

Similarly, when I was a child, the first companies that started using
electronic brains to bill customers sent notes printed in all capital
letters and with no apostrophes.

The minimal model that I have in mind is slightly less minimal: the least
quality that won't sacrifice the normal orthographic rules of a language.

Ciao.
Marco




RE: Some Char. to Glyph Statistics, Pan/Single Font

2001-05-31 Thread Marco Cimarosti

Mike Meir wrote:
 The problem with your glyph statistics is that they are based 
 on mould counts employed by the Monotype hot metal typesetters.

I agree: no one will ever come up with *the* correct count.

Such general evaluations simply depend on too many things to be useful.
E.g.: which language(s) are targeted, what degree of typographic excellence
is required, and (as Mike explained very well) the kind of technology
involved and its limitations.

The simple fact that software fonts can overlay glyphs can cause a great
factor of reduction,  compared to lead type. Similarly, the fact that a
software font technology has the capability of kerning glyphs vertically can
reduce dramatically the inventory of glyphs needed for certain scripts.

Moreover, different technologies may have totally different meanings for the
word glyph. E.g., I have heard of Arabic fonts that analyze the Arabic
script well under the level of a grapheme: segments of lines and
individual dots were stored separately and assembled at display time.
Comparing the number of glyphs in such an a font with the inventory of a
more traditional font is what we call sum up apples and pears.

 Turning to Devanagari, our researches indicate that the total 
 number of script units (In Unicode terms, combinations of 
 consonants, halants, vowel signs and other signs),  excluding 
 the Unicode characters in the range 0951 to 0954, in use is 
 around the 5550 mark. It is actually greater than this, since 
 there are a number of characters relating to Sanskrit sandhi 
 for which we do not have any conjunct-vowel statistics.

As an opposite example for Devanagari, I did a little research on my own on
a minimal rendering scheme for Unicode Indic scripts. The scenario behind
this evaluation was low-resolution displays or printers and simple bitmapped
fonts.

For Devanagari's 77 characters (non-decomposable L and M characters) my
set of glyphs was just 82 pieces. Of course, such a ratio (about 1:1.06)
requires dropping any typographical gracefulness: of all the complexity of
Devanagari, just a handful of half-consonants and ligatures was preserved.

Neither your 5550 nor my 82 are of much use to anyone who has even
slightly different requirements. However, the contrast between these two
figures perhaps says something about the difficulty of such a count.

_ Marco




RE: Some Char. to Glyph Statistics, Pan/Single Font

2001-05-31 Thread Marco Cimarosti

Jungshik Shin wrote:
   I think I know how you counted (initial consonants:
 two for syllables with and without final consonants, three for three
 kinds of vowel position/shape, vowels: two for syll. 
 with/without final consonants) and think you got it right.

You caught me with hands in jam: that was exactly my way of thinking. While
I see that this is clearly too naive to be right, I would not be able to
improve it any further myself.

I welcome any refinement. Especially, I was curious about the other ratios
(DOS 1:8,1:4,1:4; X11win 1:10,1:3,1:4; TrueType 1:~30) that you mentioned on
your previous message.

_ Marco




RE: Some Char. to Glyph Statistics, Pan/Single Font

2001-05-31 Thread James E. Agenbroad

   Thursday, May 31, 2001
My goal was never to give a specific number of glyphs needed to display a
particular Indian or other script.  As others have pointed out, this
depends among other things, on the particular display device and its font
processing software possibly including the operating system.  My goals
were to point out that Arabic and South and Southeast Asian scripts require:
1. Many more glyphs than character codes and, 2. As important, software to
render character codes legibly from the available glyphs. Discussions of a
single Unicode font that do not mention such software seem pointless, or
worse, managers might believe them.  I wonder it we could usefully define
levels of legibility for displaying a language or writing system, or is it
too subjective?  Is evoking a lam alef ligature when alef follows a lam the
minimal level for any language using Arabic script?  For languages using
Devanagari script is transposing the short i matra (U+093F) to precede the
consonant(s) it follows the minimum?
 Regards,
  Jim Agenbroad (disclaimer and address at bottom)
 On Thu, 31 May 2001, Marco Cimarosti wrote:

 Mike Meir wrote:
  The problem with your glyph statistics is that they are based 
  on mould counts employed by the Monotype hot metal typesetters.
 
 I agree: no one will ever come up with *the* correct count.
 
 Such general evaluations simply depend on too many things to be useful.
 E.g.: which language(s) are targeted, what degree of typographic excellence
 is required, and (as Mike explained very well) the kind of technology
 involved and its limitations.
 
 The simple fact that software fonts can overlay glyphs can cause a great
 factor of reduction,  compared to lead type. Similarly, the fact that a
 software font technology has the capability of kerning glyphs vertically can
 reduce dramatically the inventory of glyphs needed for certain scripts.
 
 Moreover, different technologies may have totally different meanings for the
 word glyph. E.g., I have heard of Arabic fonts that analyze the Arabic
 script well under the level of a grapheme: segments of lines and
 individual dots were stored separately and assembled at display time.
 Comparing the number of glyphs in such an a font with the inventory of a
 more traditional font is what we call sum up apples and pears.
 
  Turning to Devanagari, our researches indicate that the total 
  number of script units (In Unicode terms, combinations of 
  consonants, halants, vowel signs and other signs),  excluding 
  the Unicode characters in the range 0951 to 0954, in use is 
  around the 5550 mark. It is actually greater than this, since 
  there are a number of characters relating to Sanskrit sandhi 
  for which we do not have any conjunct-vowel statistics.
 
 As an opposite example for Devanagari, I did a little research on my own on
 a minimal rendering scheme for Unicode Indic scripts. The scenario behind
 this evaluation was low-resolution displays or printers and simple bitmapped
 fonts.
 
 For Devanagari's 77 characters (non-decomposable L and M characters) my
 set of glyphs was just 82 pieces. Of course, such a ratio (about 1:1.06)
 requires dropping any typographical gracefulness: of all the complexity of
 Devanagari, just a handful of half-consonants and ligatures was preserved.
 
 Neither your 5550 nor my 82 are of much use to anyone who has even
 slightly different requirements. However, the contrast between these two
 figures perhaps says something about the difficulty of such a count.
 
 _ Marco
 
 

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





RE: Some Char. to Glyph Statistics, Pan/Single Font

2001-05-31 Thread Edward Cherlin

At 5:35 PM +0200 5/31/01, Marco Cimarosti wrote:
Jungshik Shin wrote:
I think I know how you counted (initial consonants:
  two for syllables with and without final consonants, three for three
  kinds of vowel position/shape, vowels: two for syll.
  with/without final consonants) and think you got it right.

You caught me with hands in jam: that was exactly my way of thinking. While
I see that this is clearly too naive to be right, I would not be able to
improve it any further myself.

I welcome any refinement. Especially, I was curious about the other ratios
(DOS 1:8,1:4,1:4; X11win 1:10,1:3,1:4; TrueType 1:~30) that you mentioned on
your previous message.

_ Marco

A quick look at the Hangul syllable table starting on page 744 of 
TOS3 shows a much greater variation. If you look at the pages 
slightly cross-eyed so that each glyph aligns with a neighbor, and 
wink each eye alternately, you can get the effect of a blink 
comparator of the type used in astronomy before computer image 
processing became practical. If you can't keep the alignment while 
winking, just look for the fuzzy letters where the glyphs don't match 
up.

Or we could ask a typographer.  :-)
-- 

Edward Cherlin
Generalist
A knot! exclaimed Alice. Oh, do let me help to undo it.
Alice in Wonderland




RE: Some Char. to Glyph Statistics, Pan/Single Font

2001-05-31 Thread Edward Cherlin

At 5:12 PM +0200 5/31/01, Marco Cimarosti wrote:
Hi.

 Well, it can be said to be above  the minimum :-) depending on
  how you look at things. If you're a developer of embedded
  device with a
  really stringent requirement in memory footprint (for font
  and others),
  you may just go with 1:1 ratios for all three groups of Jamos
  (consonants
  and vowels) as found in old (mechanical) Hangul typewriters. However,
  as you can guess, the result is not pleasing to most eyes.

The manual Hangul typewriter I learned on had multiple forms for 
initial consonants, supplied by means of an extra shift level. (Yes! 
A mechanical buckybit!!  %-[ )

The really minimal level was *linear* Hangul produced by the telegraph system.

[snip]

The minimal model that I have in mind is slightly less minimal: the least
quality that won't sacrifice the normal orthographic rules of a language.

Which rules are the normal ones? Every publisher I've had anything to 
do with has used different sets of rules, over quite a wide range. We 
can't even agree whether ligatures are required in English, or 
whether an ASCII-sorted index is sufficiently human-readable.

Ciao.
Marco

-- 

Edward Cherlin
Generalist
A knot! exclaimed Alice. Oh, do let me help to undo it.
Alice in Wonderland




Some Char. to Glyph Statistics, Pan/Single Font

2001-05-30 Thread James E. Agenbroad

 Wednesday, May 30, 2001
Attached is a note I wrote in September 1993 about the ratio of characters
to glyphs in several Indic scripts.  Much has changed on the Unicode
front since then, but I think the need for rendering software to decide
which of many glyphs to use to represent a given sequence of codes is
still with us.  A similar situation obtains with Arabic--unless one
requires the use of Arabic presentation forms.  If one excludes the
combining characters at U+0300 to 0362 European scripts tend to have a 1:1
character to glyph ratio; Chinese, Japanese and (maybe Korean) scripts
also tend to have a 1:1 character to glyph ratio.  But most scripts
between Europe and the Far East--Arabic, South and Southeast Asian ones do
not.  Unless the rendering software and the fonts are in synch the results
will be unsatisfactory.  A few posting on the 'single font' discussion
have mentioned this but I hope some data may be helpful.
 The story goes that back in Ancient Greece (I think) someone was
describing Utopia and a listener asked, But who will do the work? and
the reply was, Oh, we will  have slaves.  The computer now can be an
effective slave when given explicit instructions, but without consistent
instructions the results will not be satisfactory.
 This may be beyond the scope of Unicode which aims to unambiguously
encode text for the computer (and succeeds) but does not dwell on details
of its input or output--rendering it legible for humans to read.  

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  

-- Forwarded message --
Date: Fri, 10 Sep 93 14:12:07 -0400
From: jage (James E. Agenbroad)
To: [EMAIL PROTECTED]
Cc: jage@seq1
Subject: Some Character to Glyph Statistics

Friday, September 10, 1993
Glenn,
 Recent Internet discussions about fonts for ISO10646/Unicode prompted
me to do some counting.  The data are suggestive rather than definitive
at least in part because the counts of glyphs are based on only a single
source and it may not be up to date.  They do suggest that for various
writing systems of South (and maybe Southeast) Asia based on Indic scripts
the ratio of coded characters to glyphs is not 1:1 but 1:2 or even 1:3.
I'm sure this is no surprise to you but the Internet discussions make no
meniton of it so I thought I would.  When a writing system has more glyphs
than characters I think there must be software to decide when which glyph
is wanted.  (This software may also need to know something about the
target device but that's not an issue I can shed any light on.)  
 As a preliminary assessment I have counted the number of character
codes ISO 10646 assigns for several writing systems and the number of
glyphs from synopses of the same writing systems as found in Specimen
book of 'Monotype non-latin faces issued loose-leaf by Monotype
Corporation.  I geve the number and date of each sheet.  In counting
I have omitted western style punctuation and numerals.  

Writing System, date 10646 Mono. Rough
 chars glyphsratio

Bengali 470,5/6589   331 1:3
Burmese 558,5/6476   213 1:3
Devanagari155,8/75 104   248 1:2.5
Gujarthi 460,7/71   75   232 1:3
Gurmukhi 601,9/74   74   146 1:2
Kannada 588,9/6980   236 1:3
Malayalam 590,7/75  78   590 1:7
Oriya 706,3/70  78   371 1:4
Sinhalese 557,1/64  90   348 1:3.5
Tamil 280,1/64  61   171 1:3
Telugu 626,3/71 80   312 1:4
Thai 577,4/74   92   208 1:2
Tibetan (Van Osterman)  80   158 1:2

For Sinhalese and Tibetan (not in 10646 yet) the count is from Unicode
Technical report no. 2.  For Devanagari and Gurmukhi has a note: A
special mould is required for these matrices.  THe relation of these
fonts to current systems is unclear.  As noted, my Monotype book does
not include Tibetan, the glyphs are from George Vvan Ostermann's
Manual of foreign languages 4th ed. 1952--Icounted the leters, ligtures,
numerals, vowel signs and punctuation.

I would also like to expres my agreement with the man from New South Wales
who said libraries will need to display lots of different characters.  I
do not know if this means one large font or m any so long as they are
all available when needed to display a string of ccharacter codes--without
the recipent knowing what will be needed and taking extra measures to
load the proper font.  The fonts for such purposes would not need to have
extremely

Re: Some Char. to Glyph Statistics, Pan/Single Font

2001-05-30 Thread Eric Muller

You may be interested by Creating and supporting OpenType fonts for Indic
scripts and Creating and supporting OpenType fonts for Arabic scripts, both
available at http://www.microsoft.com/typography/tt/tt.htm.

To give a little bit of context, the OpenType architecture separates shaping in
two parts: the part that is script-dependent but font-independent (embodied on
Windows in the Uniscribe engine), and the part that is font-dependent (embodied in
GSUB/GPOS/GDEF/BASE tables in fonts). The GPOS/GSUB tables are best conceived as
shaping subprograms stored in the font, and those subprograms are called by
Uniscribe. The documents above describe the API between the shaping engine and the
fonts.

I am not aware of similar material for AAT fonts, but that's another place to look
at.

Also, Latin cursive fonts tend to have a large number of glyphs: ligatures to
simulate the connectivity of the individual letters, and variants to simulate the
randomness of hand writing. This is not much different from Arabic fonts, not
surprisingly.

Eric.






Re: Some Char. to Glyph Statistics, Pan/Single Font

2001-05-30 Thread Jungshik Shin

On Wed, 30 May 2001, James E. Agenbroad wrote:

  Thank you for interesting piece of information.


  Wednesday, May 30, 2001
 Attached is a note I wrote in September 1993 about the ratio of characters
 to glyphs in several Indic scripts.  Much has changed on the Unicode
 front since then, but I think the need for rendering software to decide

 character to glyph ratio; Chinese, Japanese and (maybe Korean) scripts
 also tend to have a 1:1 character to glyph ratio.  But most scripts

  In case of Korean Hangul,  your 'maybe' can be justified because
the situation is not so simple. If you only consider pre-composed syllable
block beg. at U+AC00 and have fonts with pre-composed glyphs for all
of those syllables, it could be 1:1. However, if you turn your eyes
to U1100 Hangul Consonant/Vowel block and want to have a full-fledged
support of medivial Korean, the ratio can be anybody's guess from 1:1
(poor quality,unconventional shape) to 1:n to m to n (where n can be
a few tens if not more). In 1980's, typical MS-DOS based programs(or
Hangul rendering libraries/engines) used something like 1:8, 1:4, 1:4 for
initial consonants, medial vowels, and final consonants, respectively. A
Korean variant of xterm (a terminal emulator for X11 window system) has
been using fonts with 1:10,1:3,1:4 ratio. Some high quality true-type
fonts for Hangul these days (internally) have 1:n (n ~ 30), I believe.


 -- Forwarded message --
 Date: Fri, 10 Sep 93 14:12:07 -0400
 From: jage (James E. Agenbroad)
 Subject: Some Character to Glyph Statistics

  Recent Internet discussions about fonts for ISO10646/Unicode prompted
 me to do some counting.  The data are suggestive rather than definitive
 at least in part because the counts of glyphs are based on only a single
 source and it may not be up to date.  They do suggest that for various
 writing systems of South (and maybe Southeast) Asia based on Indic scripts
 the ratio of coded characters to glyphs is not 1:1 but 1:2 or even 1:3.

 I thought (without any basis and hard data. that is, it was just my wild
guess)  the ratio would be much higher than 1:3 for Indic scripts.
With the ratio being only 1:3 or so, I guess Indic scripts are in much
a better shape to be supported than medivial (and some elements of
modern) Korean. Projects like Pango (http://www.pango.org) have already
begun to support Indic and Thai scripts let alone other commercial and
non-commercial implementations (Uniscribe,AAT, Graphite,...). Therefore,
eight years since your original message haven't been wasted, I think :-)

  Jungshik Shin





Unicode character encoding statistics

2001-02-16 Thread Kenneth Whistler

BTW, if anyone was wondering where I came up with the
figure 880,325 reserved unassigned code points for Unicode
3.1, here are the complete statistics for Unicode 3.0 and
Unicode 3.1:

Unicode: U 3.0   U 3.1

BMP Alphas/Symbols   10236   10238
Suppl Alphas/Symbols  1691
Han (URO)20902   20902
Han (Ext A)   65826582
Han (Ext B)  42711
Han Compat 302 302
Suppl Han Compat   542
Hangul Syllables 11172   11172

Subtotal 49194   94140

BMP Private Use   64006400
Suppl Private Use   131068  131068
Surrogate Code Points 20482048
Controls65  65
BMP Noncharacters2  34
Suppl Noncharacters 32  32
BMP Reserved  78277793
Suppl Reserved  917476  872532

The total number of code points accounted for
here is 1,114,112 (= 17 x 64K), i.e.
U+..U+10.

--Ken