RE: RECOMMENDATIONs( Term Asian is not used properly on Computers andNET)

2001-05-31 Thread Edward Cherlin

At 9:21 AM -0700 5/30/01, Carl W. Brown wrote:
Sorry,

Han or Hanzi is not adequate to cover Korean.  If you want to get 
picky I am sure that most people are aware that there are Chinese 
minority languages for example that use other fonts.  Typically the 
term CJK works for most of us.  Those that don't understand the term 
generally are not familiar with issues.

With Unicode you don't have the MBCS issues.  What is left are more 
subtle issues.  You could call them East Asian fonts as long as you 
distinguished then from Southeast Asian fonts which,except for 
Vietnamese, are more like Indic fonts.

Carl

[sigh] There is no such thing as the correct names for anything. If 
people agree to use names in the same way, we have achieved 
something, and if the names reflect the structure of the things in 
question even a little, we have achieved a lot.

The names Europe and Asia are accidents of Greek history and 
culture passed down for more than two millennia, not real geographic 
divisions, and certainly not linguistic divisions. Europe was the 
Greek territories to the west of the Bosporus (+barbarians), and Asia 
was the Greek territories to the east of the Bosporus (+barbarians).

I like to use the term Han characters to refer to the characters 
that came down to us from the Han, plus their ancestors back to the 
oracle bones and other characters created later on within the same 
tradition. This includes PRC Simplified and Vietnamese Chu Nom, but 
not other characters used in various writing systems alongside the 
Han characters: Zhuyin, Hangul, Kana, Western (Arabic/Hindu) 
numerals, punctuation, etc. I prefer not to write or speak of Han 
scripts. I am willing to use CJK or CJKV for writing systems that 
make (or used to make) essential use of Han characters, even though 
both terms are seriously inaccurate. I prefer not to use geographical 
terms for linguistic ideas, except in the rare cases, like India, 
where the geographic boundaries were drawn to match linguistic 
divisions (based in their case, on religious divisions). I do not 
expect anybody in particular to agree with me on these usages, and 
you can talk to me if you have A Better Idea[TM].

YMMV.

-Original Message-
From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED]]On Behalf Of N.R.Liwal
Sent: Wednesday, May 30, 2001 11:11 AM
To: [EMAIL PROTECTED]
Subject: RECOMMENDATIONs( Term Asian is not used properly on 
Computers and NET)

TERM ASIA IN COMPUTER  INTERNET (RECOMMENDATIONS UNICODE LIST MAY 2001)

So far the recomendations are, that Asian Text Fonts can be called:
-Han Fonts or Hanzi Fonts
-East Asian Unified Fonts
-East Asian Fonts

Urghh.

Chinese fonts
Korean fonts
Japanese fonts
Chu Nom fonts
etc. fonts
CJK fonts
Unicode fonts

Script Can be classified as:
-languages which Han ideographs
-'ideographic languages' SCRIPT
-East Asian Unified SCRIPT
- East Asian SCRIPT

Urghh. Urghh.

Traditional Chinese writing system (Han with numerals, punctuation, 
etc., with or without Zhuyin)
Simplified Chinese writing system (similarly)
Korean writing system (Hangul with or without Hanja, but with numerals, etc.)
Japanese writing system (Kanji, Hiragana, Katakana, numerals, symbols, etc.)

In each case with the possibility of adding Latin alphabet (Pinyin, 
romaji) and perhaps Cyrillic and Greek.

As I said earlier, there are no correct names except possibly by agreement.

Asian geographic expressions are better:
-Southeast Asia, East Asia CENRAL ASIA
WEST ASIA = Arabic Countries and  Neighborhood


Triple Urghh.

Have you ever heard the term granfaloon? The only association 
between location and language is *political*, and there is no nation 
without minorities.

Let us speak with moderate precision of languages usually or 
sometimes written in Arabic script or ...Indic scripts and the 
like. Please.

Thanks to all who participated in discussion:

You're certainly welcome.


N.R.Liwal
Asiaosft
http://www.liwal.netwww.liwal.net
[snip]
-- 

Edward Cherlin
Generalist
A knot! exclaimed Alice. Oh, do let me help to undo it.
Alice in Wonderland




RE: RECOMMENDATIONs( Term Asian is not used properly on Computers and NET)

2001-05-31 Thread てんどう瘢雹りゅう瘢雹じ
Um.
Okay, what is the font supposed to have?

Is this list correct??

1. Han
2. Kana
3. Hangul
4. Those many, many Latin letters with diacritics for Vietnamese use
5. Probably also ASCII and misc. Han punctuation and similar odds and ends

(sigh) Are you sure you want just *one* box for that? I think you want four.

ARRRGGHH


★じゅういっちゃん★

"AIS TSXQ QDOO TD AISC TDQMIG, HYCTDL,
ZIC HIIUPLB XSHM GDOPHPISX CYTDL."
"QMD XDHCDQ, AIS XDD,
PX QMDCD'X LI CDHPWD.
P VSXQ WSQ RMYQ P MYED KA TA YCT PL."


RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)

2001-05-31 Thread Marco Cimarosti

Kenneth Whistler wrote:
 Plane 14 PUA usage description tags? Naaah, nobody would suggest such
 a bizarre thing, would they?

The three words PUA usage description are redundant, methinks. Removing
them leaves a more concise and dramatic example of a weird proposal.

_ Marco




Some Char. to Glyph Statistics, Pan/Single Font

2001-05-31 Thread Mike Meir



The problem with your glyph statistics is that they 
are based on mould counts employed by the Monotype hot metal 
typesetters.
The Monotype system was capable of extensive 
kerning, and therefore many glyphs were constructed from the elements provided 
by the moulds at the time of composition. The Monotype list of elements 
therefore comprises:

  Full characters which areeither basic or 
  couldnot be composed satisfactorily by the system for whatever reason. 
  These might properly be described as glyphs
  Elements which were combined either with the first 
  set, or with one another, to create glyphs, or approximations to glyphs at the 
  time of casting. These cannot really be considered to be glyphs, as 
  such.
However, if one allows that these elements are 
glyphs, the real number of glyphs employed by Monotype was limited by the matrix 
case: before 1962 to 225 sorts, and subsequently to 272 sorts. Although 
additional sorts might be available, they could only be used by substitution 
with another sort prior to any actual typesetting.

More recent Monotype code pages for Bengali seem to 
be around 450elements, which are combined with floating elements to create 
text.

To date all Indic script composition has been 
pretty much limited by technology. Taking Bengali as an example, Figgins, around 
1826, employed 370 sorts, many of which are kerning versions of other sorts, 
allowing the composition either of consonant-vowel combinations or 
approximations to complex conjuncts which were insufficiently common to warrant 
the creation of separate punches. But again, a number of his sorts exist only to 
allow the incorporation of combinations which could not be produced by the 
technology of the time.

Our recent revision of the Linotype Bengali code 
page extends to a font of some 980 elements. 136 of these are differently spaced 
floating elements, such vowel signs and chandrabindus, which haveno 
meaning separate from the main characters to which they may be attached, and 
which would be omitted from an opentype version.It also includes 146 
characters whichduplicate the Unicode encoded Bengali characters, which is 
required for current technological reasons - Microsoft's Office XP does not 
allow the display of Unicode encode Bengali characters in the font, or at the 
size which is expected. So the "real" number of elements is 698.(I may 
also add that we have had to produce alternative versions of the same fonts in 
which non-spacing elements actually space quiteconsiderably, because 
ofthe very strange behaviour of Microsoft's Internet Explorer 5.5, so the 
glyph count islarger than the 980 - another case of technology determining 
counts).

Turning to Devanagari, our researches indicate that 
the totalnumber of script units (In Unicode terms, combinations of 
consonants, halants, vowel signs and other signs), excluding the Unicode 
charactersin the range 0951 to 0954, in use is around the 5550 mark. It is 
actually greater than this, since there are a number of characters relating to 
Sanskrit sandhi for which we do not have any conjunct-vowel 
statistics.

In principle, all these should be regarded 
asglyphs, thoughfew fonts are likely to implement them all (the 
slaves in this context needing to be human beings, since the issue of the 
spacing and modification of a smaller number of base elements to produce all 
these glyphs is an aesthetic rather than a mechanical problem)

I have also not included in the count the many 
variant forms of glyphs which occur as result of differences in formulation for 
particular combinations.

(I have also excluded the rather large number of 
glyphs which are to be found in the Mangal font supplied by Microsoft, but which 
seem to be there purely because of a rather strange and literal interpretation 
of the Unicode Devanagari shaping rules, on the grounds that these glyphs exist 
only in the font, and would never be used in text.)







RE: Some Char. to Glyph Statistics, Pan/Single Font

2001-05-31 Thread Marco Cimarosti

Hi.

Well, it can be said to be above  the minimum :-) depending on
 how you look at things. If you're a developer of embedded 
 device with a
 really stringent requirement in memory footprint (for font 
 and others),
 you may just go with 1:1 ratios for all three groups of Jamos 
 (consonants
 and vowels) as found in old (mechanical) Hangul typewriters. However,
 as you can guess, the result is not pleasing to most eyes.

Of course. If the requirements are even more stringent (e.g., the user is
blind) you can even represent the letters with a 2x3 matrix of pixels.

Similarly, when I was a child, the first companies that started using
electronic brains to bill customers sent notes printed in all capital
letters and with no apostrophes.

The minimal model that I have in mind is slightly less minimal: the least
quality that won't sacrifice the normal orthographic rules of a language.

Ciao.
Marco




RE: Some Char. to Glyph Statistics, Pan/Single Font

2001-05-31 Thread Marco Cimarosti

Mike Meir wrote:
 The problem with your glyph statistics is that they are based 
 on mould counts employed by the Monotype hot metal typesetters.

I agree: no one will ever come up with *the* correct count.

Such general evaluations simply depend on too many things to be useful.
E.g.: which language(s) are targeted, what degree of typographic excellence
is required, and (as Mike explained very well) the kind of technology
involved and its limitations.

The simple fact that software fonts can overlay glyphs can cause a great
factor of reduction,  compared to lead type. Similarly, the fact that a
software font technology has the capability of kerning glyphs vertically can
reduce dramatically the inventory of glyphs needed for certain scripts.

Moreover, different technologies may have totally different meanings for the
word glyph. E.g., I have heard of Arabic fonts that analyze the Arabic
script well under the level of a grapheme: segments of lines and
individual dots were stored separately and assembled at display time.
Comparing the number of glyphs in such an a font with the inventory of a
more traditional font is what we call sum up apples and pears.

 Turning to Devanagari, our researches indicate that the total 
 number of script units (In Unicode terms, combinations of 
 consonants, halants, vowel signs and other signs),  excluding 
 the Unicode characters in the range 0951 to 0954, in use is 
 around the 5550 mark. It is actually greater than this, since 
 there are a number of characters relating to Sanskrit sandhi 
 for which we do not have any conjunct-vowel statistics.

As an opposite example for Devanagari, I did a little research on my own on
a minimal rendering scheme for Unicode Indic scripts. The scenario behind
this evaluation was low-resolution displays or printers and simple bitmapped
fonts.

For Devanagari's 77 characters (non-decomposable L and M characters) my
set of glyphs was just 82 pieces. Of course, such a ratio (about 1:1.06)
requires dropping any typographical gracefulness: of all the complexity of
Devanagari, just a handful of half-consonants and ligatures was preserved.

Neither your 5550 nor my 82 are of much use to anyone who has even
slightly different requirements. However, the contrast between these two
figures perhaps says something about the difficulty of such a count.

_ Marco




RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

2001-05-31 Thread Carl W. Brown



Simon,

I now 
see that you support both "UTF8" where surrogates are encoded as 6 bytes and 
"AL32UTF8" where surrogates are encoded as 4 bytes. The way your 
documentation reads many users are likely to select "UFT8" over 
"AL32UTF8". You should have users who already have UTF8 databases migrate 
to the proper UTF8 encoding rather than making them the exception to the 
rule.

If you 
have this funny encoding please don't call it UTF8 because it is not UTF8 and 
will only confuse users. You could call it OTF8 or something like that but 
not UTF8.

Carl

  -Original Message-From: [EMAIL PROTECTED] 
  [mailto:[EMAIL PROTECTED]]On Behalf Of Simon 
  LawSent: Wednesday, May 30, 2001 11:02 AMTo: 
  [EMAIL PROTECTED]Subject: Re: ISO vs Unicode UTF-8 (was RE: UTF-8 
  signature in web and email)Hi Folks, 
  Over the last few days, this email thread has generated many interesting 
  discussions on the proposal of UTF-8s. At the same time some speculations have 
  been generated on why Oracle is asking for this encoding form. I hope to 
  clarify some of these misinformation in this email. 
  In Oracle9i our next Database Release shipping this summer, we have 
  introduced support for two new Unicode character sets. One is 'AL16UTF16' 
  which supports the UTF-16 encoding and the other is 'AL32UTF8' which is the 
  UTF-8 fully compliant character set. Both of these conform to the Unicode 
  standard, and surrogate characters are stored strictly in 4 bytes. For more 
  information on Unicode support in Oracle9i , please check out the whitepaper 
  "The power of Globalization Technology" on http://otn.oracle.com/products/oracle9i/content.html 

  The requests for UTF-8s came from many of our Packaged Applications 
  customers (such as Peoplesoft , SAP etc.), the ordering of the binary sort is 
  an important requirement for these Oracle customers. We are supporting them 
  and we hope to turn this into a TR such that UTF-8s can be referenced by other 
  vendors when they need to have compatible binary order for UTF-16 and UTF-8 
  across different platforms. 
  The speculation that we are pushing for UTF-8s because we are trying to 
  minimize our code change for supporting surrogates, or because of our 
  unique database design are totally false. Oracle has a fully 
  internationalized extensible architecture and have introduced surrogate 
  support in Oracle9i. In fact we are probably the first database 
  vendor to support both the UTF-16 and UTF-8 encoding forms, we will continue 
  to support them and conform to future enhancements to the Unicode Standard. 
  Regards  
  Simon 
  "Carl W. Brown" wrote: 
  Ken, 
I suspect that Oracle is specifically pushing for this standard because 
of its unique data base design. In a sense Oracle almost picks it 
self up by its own bootstraps. It has always tried to minimize 
actual code. Therefore it was a natural choice to implement 
Unicode with UTF-8 because it is easy to reuse the multibyte support 
with minor changes to handle a different character length 
algorithm. This has been one of the reasons that Oracle has been 
successful. Its tinker toy like design has enabled them to quickly 
adapt and add new features. Now however, they should take the time 
do "do it right". Its UTF-8 storage creates problems for database 
designers because they can not predict field sizes. This is a 
problem with MBCS code pages but UTF-8s will make it worse. There 
will be lots of wasted storage when characters can vary in size from 1 
to 6 bytes. 
Most other database systems require specific code to support 
Unicode. As a consequence most have implemented using UCS-2. 
Their migration is obviously to use UTF-16. UTF-8s buys them 
nothing but headaches. 
Carl 
-Original Message- From: Kenneth Whistler [mailto:[EMAIL PROTECTED]] Sent: Tuesday, 
May 29, 2001 3:47 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; 
[EMAIL PROTECTED] Subject: RE: ISO vs Unicode UTF-8 (was RE: UTF-8 
signature in web and email) 
Carl, 
 Ken,   UTF-8s is essentially a way to ignore 
surrogate processing. It allows a  company to encode UTF-16 
with UCS-2 logic.   The problem is that by not implementing 
surrogate support you can introduce  subtle errors. For 
example it is common to break buffers apart into  segments. 
These segments may be reconcatinated but they may be processed  
individually. 
You are preaching to the choir here. I didn't state that *I* was in 
favor of UTF-8S -- only that we have to be careful not to assume that 
UTC will obviously not support it. The proponents of UTF-8S are 
vigorously and actively campaigning for their proposal. In 
standardization committees, proposals that have committed, active 
proponents who can aim for the long haul, often have a way of getting 
adopted in one form or another, unless there are 

RE: RECOMMENDATIONs( Term Asian is not used properly on Computers and NET)

2001-05-31 Thread Jonathan Rosenne

If we mean CJK why can't we say CJK?

Jony




RE: Some Char. to Glyph Statistics, Pan/Single Font

2001-05-31 Thread Marco Cimarosti

Jungshik Shin wrote:
   I think I know how you counted (initial consonants:
 two for syllables with and without final consonants, three for three
 kinds of vowel position/shape, vowels: two for syll. 
 with/without final consonants) and think you got it right.

You caught me with hands in jam: that was exactly my way of thinking. While
I see that this is clearly too naive to be right, I would not be able to
improve it any further myself.

I welcome any refinement. Especially, I was curious about the other ratios
(DOS 1:8,1:4,1:4; X11win 1:10,1:3,1:4; TrueType 1:~30) that you mentioned on
your previous message.

_ Marco




Re: RECOMMENDATIONs( Term Asian is not used properly on Computers and NET)

2001-05-31 Thread N.R.Liwal

Dear Jungshik Shin;

Thanks, good explinations, I hope those who are interested in Software and
Web for Asia will be
benefited.

Thanks.

Liwal

- Original Message -  On Wed, 30 May 2001, N.R.Liwal wrote:

  TERM ASIA IN COMPUTER  INTERNET (RECOMMENDATIONS UNICODE LIST MAY
2001)
 
  So far the recomendations are, that Asian Text Fonts can be called:
  -Han Fonts or Hanzi Fonts

   As already pointed out, this is not adqueate to cover Korean
 and Japanese because other scripts are also used for them. Moreover,
 Japanese may not like 'Hanzi' even if you're talking about
 Hanzi/Kanji/Hanja alone. Even 'Han' (which is more neutral) could be
 balked at by some.

  -East Asian Unified Fonts
  -East Asian Fonts

   If they mean fonts for Chinese, Japanese and Korean writing
 systems, I would pick 'East Asian fonts'.


  Script Can be classified as:
  -languages which Han ideographs

 you're talking not about language(s) but about script(s) , right?

  -'ideographic languages' SCRIPT

A language cannot be ideographic as I wrote before. Has anybody else
 mentioned this term other than me? I mentioned it not because I think it's
 appropriate BUT because I think that the term (ideographic language)
 MUST NOT be used.

  -East Asian Unified SCRIPT

   What's been 'unified' is Han 'ideographs' while there ARE other
 scripts in (more predominant) use in the region (even if you only mean
 Chinese,Japanese and Korean by 'East Asian').

  - East Asian SCRIPT

   What 'script' (not 'scripts') are you talking about here?
 If you just mean 'Han ideographs', I don't think you  need to come up with
 new term(s). I think 'Han ideograph' (or CJK ideographs if it ONLY means
 Hanzi/Kanji/Hanja and nothing else)  is good enough (although certainly
 not perfect.)  On the other hand, if you're talking about all the scripts
 used in Northeast/East Asian countries (or China, Japan and Korea),
 you CANNOT use any of the above with the possible exception of the last
 (which can be used provided that they're made plural 'East Asian Scripts'
 to reflect that there are *multiple* scripts in use.)


  Asian geographic expressions are better:
  -Southeast Asia, East Asia CENRAL ASIA
  WEST ASIA = Arabic Countries and  Neighborhood

   I believe the following are widely used at least in 'geography
 text books' and 'encyclopedia'. Also, many US schools with regional
 studies programs use similar divisions (except for Southwest Asian which
 appears to be refered to as 'Middle East' most of time). This division
 is bound to be aribtrary to some degree (Asian continent is not a circle
 or any definitive geometric shape which can be divided in an unambiguous
 way ;-) )


   East Asia/Northeast Asia : Japan, Korea, China (it's a huge country,
but)
   'Far East' (in Western media and
   at least in some East Asian
media :-) )
   Southeast Asia   : Indochina,
Malaysia,Singapor,Indonesia,Thai,Burma,
  .
   South Asia   : India,Pakistan,Sri
Lanka,Bangladesh,Nepal,
   Soutwest Asia: The part of Asia usually called 'Middle East'
  (in Western media and at least in some
   East Asian media :-) )
  Arabian peninsular, Iran, Iraq,
  Turkey(Near East?),
  Afganistan(it could be put in South Asia...)
   Central  Asia: Mongol and some former republics of USSR (now
  independent. e.g Kazahstan)
   North Asia (??)  : Siberia?

   FYI, Mozilla uses the following:

East Asian  : Chinese, Japanese, Korean
SE  SW Asian   : Thai, Armenian*, Turkish*
Middle Eastern  : Hebrew, Arabic
Western European: ..., Greek*(why?),.
Eastern European:

I guess it's better than Office XP which calls Chinese,Japanese, Korean
 'Asian', but it could still have done better. (Middle East and SW Asia
 overlap each other so that they had better split up SESW Asia, remove
 Middle East'ern',  put  Armenian, Turkish, Hebrew and Arabic into 'SW
 Asian' and fill up 'SE Asian' with Thai, Vietnamese, Cambodian and so
 forth when they get supported). That is, I would use the following
 for programs like web browser and word processor.


East Asian   : Chinese, Japanese, Korean + some more
(or NE Asian)  if necessary and supported (e.g Yi)
SE Asian : Thai,Vietnamese,Lao, Khmer, etc
South Asian  : various Indic scripts (other than those included in
   SE Asian), Tibet*
SW Asian : Arabic, Hebrew, Syriac, Armenian*, Turkish*, etc
(Middle Eastern)
Central Asian: Mongolian, Khazahstan(?),   when supported

   Of course, geographic break-up has its pitfalls and some people
 for sure wouldn't like it for various reasons. For instance, Turkish
 and 

RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

2001-05-31 Thread Ayers, Mike


If you have this funny encoding please don't call it UTF8 because it is not
UTF8 and will only confuse users.  You could call it OTF8 or something like
that but not UTF8.

How about WTF-8?

Sorry - I couldn't resist.


/|/|ike




RE: RECOMMENDATIONs( Term Asian is not used properly on Computers and NET)

2001-05-31 Thread Carl W. Brown

Liwal,

Such classifications are not easy.  For example Azeri can be written in both
Latin and Cyrillic scripts.  The Latin script is much like Turkish which has
the dotted and dot-less i.  This is not necessarily be big issue for fonts
but is requires special case shifting logic.

What do you do about scripts that are not tied to a locale?  The Orthodox
Church uses a special Cyrillic font that is different from standard
Cyrillic.

The classifications vary not only by script but by how it affects you
specific field of interest and the implementation.  For example Unicode
implements Ethiopian has fully formed syllabic characters.  Some
implementations use decomposed syllables.  This allows 256 byte code pages
but it requires glyph composition.  This would make is similar to SE Asian
and Indic processing.  But with fully composed glyphs you would classify the
language differently probably as a large characters set language with an
input method editor like the CJK languages.

Carl



-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of N.R.Liwal
Sent: Thursday, May 31, 2001 8:52 PM
To: Jungshik Shin
Cc: [EMAIL PROTECTED]
Subject: Re: RECOMMENDATIONs( Term Asian is not used properly on
Computers and NET)


Dear Jungshik Shin;

Thanks, good explinations, I hope those who are interested in Software and
Web for Asia will be
benefited.

Thanks.

Liwal

- Original Message -  On Wed, 30 May 2001, N.R.Liwal wrote:

  TERM ASIA IN COMPUTER  INTERNET (RECOMMENDATIONS UNICODE LIST MAY
2001)
 
  So far the recomendations are, that Asian Text Fonts can be called:
  -Han Fonts or Hanzi Fonts

   As already pointed out, this is not adqueate to cover Korean
 and Japanese because other scripts are also used for them. Moreover,
 Japanese may not like 'Hanzi' even if you're talking about
 Hanzi/Kanji/Hanja alone. Even 'Han' (which is more neutral) could be
 balked at by some.

  -East Asian Unified Fonts
  -East Asian Fonts

   If they mean fonts for Chinese, Japanese and Korean writing
 systems, I would pick 'East Asian fonts'.


  Script Can be classified as:
  -languages which Han ideographs

 you're talking not about language(s) but about script(s) , right?

  -'ideographic languages' SCRIPT

A language cannot be ideographic as I wrote before. Has anybody else
 mentioned this term other than me? I mentioned it not because I think it's
 appropriate BUT because I think that the term (ideographic language)
 MUST NOT be used.

  -East Asian Unified SCRIPT

   What's been 'unified' is Han 'ideographs' while there ARE other
 scripts in (more predominant) use in the region (even if you only mean
 Chinese,Japanese and Korean by 'East Asian').

  - East Asian SCRIPT

   What 'script' (not 'scripts') are you talking about here?
 If you just mean 'Han ideographs', I don't think you  need to come up with
 new term(s). I think 'Han ideograph' (or CJK ideographs if it ONLY means
 Hanzi/Kanji/Hanja and nothing else)  is good enough (although certainly
 not perfect.)  On the other hand, if you're talking about all the scripts
 used in Northeast/East Asian countries (or China, Japan and Korea),
 you CANNOT use any of the above with the possible exception of the last
 (which can be used provided that they're made plural 'East Asian Scripts'
 to reflect that there are *multiple* scripts in use.)


  Asian geographic expressions are better:
  -Southeast Asia, East Asia CENRAL ASIA
  WEST ASIA = Arabic Countries and  Neighborhood

   I believe the following are widely used at least in 'geography
 text books' and 'encyclopedia'. Also, many US schools with regional
 studies programs use similar divisions (except for Southwest Asian which
 appears to be refered to as 'Middle East' most of time). This division
 is bound to be aribtrary to some degree (Asian continent is not a circle
 or any definitive geometric shape which can be divided in an unambiguous
 way ;-) )


   East Asia/Northeast Asia : Japan, Korea, China (it's a huge country,
but)
   'Far East' (in Western media and
   at least in some East Asian
media :-) )
   Southeast Asia   : Indochina,
Malaysia,Singapor,Indonesia,Thai,Burma,
  .
   South Asia   : India,Pakistan,Sri
Lanka,Bangladesh,Nepal,
   Soutwest Asia: The part of Asia usually called 'Middle East'
  (in Western media and at least in some
   East Asian media :-) )
  Arabian peninsular, Iran, Iraq,
  Turkey(Near East?),
  Afganistan(it could be put in South Asia...)
   Central  Asia: Mongol and some former republics of USSR (now
  independent. e.g Kazahstan)
   North Asia (??)  : Siberia?

   FYI, Mozilla uses the 

RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

2001-05-31 Thread Ayers, Mike


 From: Carl W. Brown [mailto:[EMAIL PROTECTED]]
 
 I resisted calling it FTF-8 (Funky Transfer Format - 8), but 
 if you want to
 call it Weird Transfer Format - 8, I don't have any real objections.

Well, that's ONE possible translation of WTF...


/|/|ike




RE: RECOMMENDATIONs( Term Asian is not used properly on Computersand NET)

2001-05-31 Thread James E. Agenbroad

  Thursday, May 31, 2001
We seem to have strayed from searching for a clearer term than Asian.  I
think part of the problem is that many language names are also national
adjectives, e.g., Chinese, Japanese and Korean.  Likewise names of scripts
(or writing systems) are also often names of languages, e.g., Arabic.
 I would hope that input methods (for Chinese or Amharic charcters) remain
a separate issue: so long as it results in a Unicode encoding that can be
unambiguously shared, it should not matter what keystrokes were used.  (An
analogy might be QWERTY vs. Dvorak input not effecting ASCII.) Input methods
are still important issue but a separate one. 

 On Thu, 31 May 2001, Carl W. Brown wrote:

 Liwal,
 
 Such classifications are not easy.  For example Azeri can be written in both
 Latin and Cyrillic scripts.  The Latin script is much like Turkish which has
 the dotted and dot-less i.  This is not necessarily be big issue for fonts
 but is requires special case shifting logic.
 
 What do you do about scripts that are not tied to a locale?  The Orthodox
 Church uses a special Cyrillic font that is different from standard
 Cyrillic.
 
 The classifications vary not only by script but by how it affects you
 specific field of interest and the implementation.  For example Unicode
 implements Ethiopian has fully formed syllabic characters.  Some
 implementations use decomposed syllables.  This allows 256 byte code pages
 but it requires glyph composition.  This would make is similar to SE Asian
 and Indic processing.  But with fully composed glyphs you would classify the
 language differently probably as a large characters set language with an
 input method editor like the CJK languages.
 
 Carl
 
 
 
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of N.R.Liwal
 Sent: Thursday, May 31, 2001 8:52 PM
 To: Jungshik Shin
 Cc: [EMAIL PROTECTED]
 Subject: Re: RECOMMENDATIONs( Term Asian is not used properly on
 Computers and NET)
 
 
 Dear Jungshik Shin;
 
 Thanks, good explinations, I hope those who are interested in Software and
 Web for Asia will be
 benefited.
 
 Thanks.
 
 Liwal
 
 - Original Message -  On Wed, 30 May 2001, N.R.Liwal wrote:
 
   TERM ASIA IN COMPUTER  INTERNET (RECOMMENDATIONS UNICODE LIST MAY
 2001)
  
   So far the recomendations are, that Asian Text Fonts can be called:
   -Han Fonts or Hanzi Fonts
 
As already pointed out, this is not adqueate to cover Korean
  and Japanese because other scripts are also used for them. Moreover,
  Japanese may not like 'Hanzi' even if you're talking about
  Hanzi/Kanji/Hanja alone. Even 'Han' (which is more neutral) could be
  balked at by some.
 
   -East Asian Unified Fonts
   -East Asian Fonts
 
If they mean fonts for Chinese, Japanese and Korean writing
  systems, I would pick 'East Asian fonts'.
 
 
   Script Can be classified as:
   -languages which Han ideographs
 
  you're talking not about language(s) but about script(s) , right?
 
   -'ideographic languages' SCRIPT
 
 A language cannot be ideographic as I wrote before. Has anybody else
  mentioned this term other than me? I mentioned it not because I think it's
  appropriate BUT because I think that the term (ideographic language)
  MUST NOT be used.
 
   -East Asian Unified SCRIPT
 
What's been 'unified' is Han 'ideographs' while there ARE other
  scripts in (more predominant) use in the region (even if you only mean
  Chinese,Japanese and Korean by 'East Asian').
 
   - East Asian SCRIPT
 
What 'script' (not 'scripts') are you talking about here?
  If you just mean 'Han ideographs', I don't think you  need to come up with
  new term(s). I think 'Han ideograph' (or CJK ideographs if it ONLY means
  Hanzi/Kanji/Hanja and nothing else)  is good enough (although certainly
  not perfect.)  On the other hand, if you're talking about all the scripts
  used in Northeast/East Asian countries (or China, Japan and Korea),
  you CANNOT use any of the above with the possible exception of the last
  (which can be used provided that they're made plural 'East Asian Scripts'
  to reflect that there are *multiple* scripts in use.)
 
 
   Asian geographic expressions are better:
   -Southeast Asia, East Asia CENRAL ASIA
   WEST ASIA = Arabic Countries and  Neighborhood
 
I believe the following are widely used at least in 'geography
  text books' and 'encyclopedia'. Also, many US schools with regional
  studies programs use similar divisions (except for Southwest Asian which
  appears to be refered to as 'Middle East' most of time). This division
  is bound to be aribtrary to some degree (Asian continent is not a circle
  or any definitive geometric shape which can be divided in an unambiguous
  way ;-) )
 
 
East Asia/Northeast Asia : Japan, Korea, China (it's a huge country,
 but)
'Far East' (in Western 

RE: Some Char. to Glyph Statistics, Pan/Single Font

2001-05-31 Thread James E. Agenbroad

   Thursday, May 31, 2001
My goal was never to give a specific number of glyphs needed to display a
particular Indian or other script.  As others have pointed out, this
depends among other things, on the particular display device and its font
processing software possibly including the operating system.  My goals
were to point out that Arabic and South and Southeast Asian scripts require:
1. Many more glyphs than character codes and, 2. As important, software to
render character codes legibly from the available glyphs. Discussions of a
single Unicode font that do not mention such software seem pointless, or
worse, managers might believe them.  I wonder it we could usefully define
levels of legibility for displaying a language or writing system, or is it
too subjective?  Is evoking a lam alef ligature when alef follows a lam the
minimal level for any language using Arabic script?  For languages using
Devanagari script is transposing the short i matra (U+093F) to precede the
consonant(s) it follows the minimum?
 Regards,
  Jim Agenbroad (disclaimer and address at bottom)
 On Thu, 31 May 2001, Marco Cimarosti wrote:

 Mike Meir wrote:
  The problem with your glyph statistics is that they are based 
  on mould counts employed by the Monotype hot metal typesetters.
 
 I agree: no one will ever come up with *the* correct count.
 
 Such general evaluations simply depend on too many things to be useful.
 E.g.: which language(s) are targeted, what degree of typographic excellence
 is required, and (as Mike explained very well) the kind of technology
 involved and its limitations.
 
 The simple fact that software fonts can overlay glyphs can cause a great
 factor of reduction,  compared to lead type. Similarly, the fact that a
 software font technology has the capability of kerning glyphs vertically can
 reduce dramatically the inventory of glyphs needed for certain scripts.
 
 Moreover, different technologies may have totally different meanings for the
 word glyph. E.g., I have heard of Arabic fonts that analyze the Arabic
 script well under the level of a grapheme: segments of lines and
 individual dots were stored separately and assembled at display time.
 Comparing the number of glyphs in such an a font with the inventory of a
 more traditional font is what we call sum up apples and pears.
 
  Turning to Devanagari, our researches indicate that the total 
  number of script units (In Unicode terms, combinations of 
  consonants, halants, vowel signs and other signs),  excluding 
  the Unicode characters in the range 0951 to 0954, in use is 
  around the 5550 mark. It is actually greater than this, since 
  there are a number of characters relating to Sanskrit sandhi 
  for which we do not have any conjunct-vowel statistics.
 
 As an opposite example for Devanagari, I did a little research on my own on
 a minimal rendering scheme for Unicode Indic scripts. The scenario behind
 this evaluation was low-resolution displays or printers and simple bitmapped
 fonts.
 
 For Devanagari's 77 characters (non-decomposable L and M characters) my
 set of glyphs was just 82 pieces. Of course, such a ratio (about 1:1.06)
 requires dropping any typographical gracefulness: of all the complexity of
 Devanagari, just a handful of half-consonants and ligatures was preserved.
 
 Neither your 5550 nor my 82 are of much use to anyone who has even
 slightly different requirements. However, the contrast between these two
 figures perhaps says something about the difficulty of such a count.
 
 _ Marco
 
 

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





RE: RECOMMENDATIONs( Term Asian is not used properly on Computersand NET)

2001-05-31 Thread Carl W. Brown

James,

One of the reasons for grouping CJK together is that they have similar
implementation strategies.  If we are grouping for that reason then maybe
Aramaic languages should fall into the same category.  In that case Asian
is a very poor term to use.  However Han/Hanzi does not work either.

Implementation is very important.  For example, Korean except for occasional
Han characters if functionally much closer to Indic scripts.  If it were not
for the crude font handling of the older systems we probably would not
implement Korean as a fully formed character set.

Carl

-Original Message-
From: James E. Agenbroad [mailto:[EMAIL PROTECTED]]
Sent: Thursday, May 31, 2001 12:30 PM
To: Carl W. Brown
Cc: [EMAIL PROTECTED]
Subject: RE: RECOMMENDATIONs( Term Asian is not used properly on
Computersand NET)


  Thursday, May 31, 2001
We seem to have strayed from searching for a clearer term than Asian.  I
think part of the problem is that many language names are also national
adjectives, e.g., Chinese, Japanese and Korean.  Likewise names of scripts
(or writing systems) are also often names of languages, e.g., Arabic.
 I would hope that input methods (for Chinese or Amharic charcters)
remain
a separate issue: so long as it results in a Unicode encoding that can be
unambiguously shared, it should not matter what keystrokes were used.  (An
analogy might be QWERTY vs. Dvorak input not effecting ASCII.) Input methods
are still important issue but a separate one.

 On Thu, 31 May 2001, Carl W. Brown wrote:

 Liwal,

 Such classifications are not easy.  For example Azeri can be written in
both
 Latin and Cyrillic scripts.  The Latin script is much like Turkish which
has
 the dotted and dot-less i.  This is not necessarily be big issue for fonts
 but is requires special case shifting logic.

 What do you do about scripts that are not tied to a locale?  The Orthodox
 Church uses a special Cyrillic font that is different from standard
 Cyrillic.

 The classifications vary not only by script but by how it affects you
 specific field of interest and the implementation.  For example Unicode
 implements Ethiopian has fully formed syllabic characters.  Some
 implementations use decomposed syllables.  This allows 256 byte code pages
 but it requires glyph composition.  This would make is similar to SE Asian
 and Indic processing.  But with fully composed glyphs you would classify
the
 language differently probably as a large characters set language with an
 input method editor like the CJK languages.

 Carl



 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of N.R.Liwal
 Sent: Thursday, May 31, 2001 8:52 PM
 To: Jungshik Shin
 Cc: [EMAIL PROTECTED]
 Subject: Re: RECOMMENDATIONs( Term Asian is not used properly on
 Computers and NET)


 Dear Jungshik Shin;

 Thanks, good explinations, I hope those who are interested in Software and
 Web for Asia will be
 benefited.

 Thanks.

 Liwal

 - Original Message -  On Wed, 30 May 2001, N.R.Liwal wrote:
 
   TERM ASIA IN COMPUTER  INTERNET (RECOMMENDATIONS UNICODE LIST MAY
 2001)
  
   So far the recomendations are, that Asian Text Fonts can be called:
   -Han Fonts or Hanzi Fonts
 
As already pointed out, this is not adqueate to cover Korean
  and Japanese because other scripts are also used for them. Moreover,
  Japanese may not like 'Hanzi' even if you're talking about
  Hanzi/Kanji/Hanja alone. Even 'Han' (which is more neutral) could be
  balked at by some.
 
   -East Asian Unified Fonts
   -East Asian Fonts
 
If they mean fonts for Chinese, Japanese and Korean writing
  systems, I would pick 'East Asian fonts'.
 
 
   Script Can be classified as:
   -languages which Han ideographs
 
  you're talking not about language(s) but about script(s) , right?
 
   -'ideographic languages' SCRIPT
 
 A language cannot be ideographic as I wrote before. Has anybody else
  mentioned this term other than me? I mentioned it not because I think
it's
  appropriate BUT because I think that the term (ideographic language)
  MUST NOT be used.
 
   -East Asian Unified SCRIPT
 
What's been 'unified' is Han 'ideographs' while there ARE other
  scripts in (more predominant) use in the region (even if you only mean
  Chinese,Japanese and Korean by 'East Asian').
 
   - East Asian SCRIPT
 
What 'script' (not 'scripts') are you talking about here?
  If you just mean 'Han ideographs', I don't think you  need to come up
with
  new term(s). I think 'Han ideograph' (or CJK ideographs if it ONLY means
  Hanzi/Kanji/Hanja and nothing else)  is good enough (although certainly
  not perfect.)  On the other hand, if you're talking about all the
scripts
  used in Northeast/East Asian countries (or China, Japan and Korea),
  you CANNOT use any of the above with the possible exception of the last
  (which can be used provided that they're made plural 'East Asian
Scripts'
  to reflect that 

RE: Some Char. to Glyph Statistics, Pan/Single Font

2001-05-31 Thread Edward Cherlin

At 5:35 PM +0200 5/31/01, Marco Cimarosti wrote:
Jungshik Shin wrote:
I think I know how you counted (initial consonants:
  two for syllables with and without final consonants, three for three
  kinds of vowel position/shape, vowels: two for syll.
  with/without final consonants) and think you got it right.

You caught me with hands in jam: that was exactly my way of thinking. While
I see that this is clearly too naive to be right, I would not be able to
improve it any further myself.

I welcome any refinement. Especially, I was curious about the other ratios
(DOS 1:8,1:4,1:4; X11win 1:10,1:3,1:4; TrueType 1:~30) that you mentioned on
your previous message.

_ Marco

A quick look at the Hangul syllable table starting on page 744 of 
TOS3 shows a much greater variation. If you look at the pages 
slightly cross-eyed so that each glyph aligns with a neighbor, and 
wink each eye alternately, you can get the effect of a blink 
comparator of the type used in astronomy before computer image 
processing became practical. If you can't keep the alignment while 
winking, just look for the fuzzy letters where the glyphs don't match 
up.

Or we could ask a typographer.  :-)
-- 

Edward Cherlin
Generalist
A knot! exclaimed Alice. Oh, do let me help to undo it.
Alice in Wonderland




RE: Some Char. to Glyph Statistics, Pan/Single Font

2001-05-31 Thread Edward Cherlin

At 5:12 PM +0200 5/31/01, Marco Cimarosti wrote:
Hi.

 Well, it can be said to be above  the minimum :-) depending on
  how you look at things. If you're a developer of embedded
  device with a
  really stringent requirement in memory footprint (for font
  and others),
  you may just go with 1:1 ratios for all three groups of Jamos
  (consonants
  and vowels) as found in old (mechanical) Hangul typewriters. However,
  as you can guess, the result is not pleasing to most eyes.

The manual Hangul typewriter I learned on had multiple forms for 
initial consonants, supplied by means of an extra shift level. (Yes! 
A mechanical buckybit!!  %-[ )

The really minimal level was *linear* Hangul produced by the telegraph system.

[snip]

The minimal model that I have in mind is slightly less minimal: the least
quality that won't sacrifice the normal orthographic rules of a language.

Which rules are the normal ones? Every publisher I've had anything to 
do with has used different sets of rules, over quite a wide range. We 
can't even agree whether ligatures are required in English, or 
whether an ASCII-sorted index is sufficiently human-readable.

Ciao.
Marco

-- 

Edward Cherlin
Generalist
A knot! exclaimed Alice. Oh, do let me help to undo it.
Alice in Wonderland