RE: RECOMMENDATIONs( Term Asian is not used properly on Computers andNET)
At 9:21 AM -0700 5/30/01, Carl W. Brown wrote: Sorry, Han or Hanzi is not adequate to cover Korean. If you want to get picky I am sure that most people are aware that there are Chinese minority languages for example that use other fonts. Typically the term CJK works for most of us. Those that don't understand the term generally are not familiar with issues. With Unicode you don't have the MBCS issues. What is left are more subtle issues. You could call them East Asian fonts as long as you distinguished then from Southeast Asian fonts which,except for Vietnamese, are more like Indic fonts. Carl [sigh] There is no such thing as the correct names for anything. If people agree to use names in the same way, we have achieved something, and if the names reflect the structure of the things in question even a little, we have achieved a lot. The names Europe and Asia are accidents of Greek history and culture passed down for more than two millennia, not real geographic divisions, and certainly not linguistic divisions. Europe was the Greek territories to the west of the Bosporus (+barbarians), and Asia was the Greek territories to the east of the Bosporus (+barbarians). I like to use the term Han characters to refer to the characters that came down to us from the Han, plus their ancestors back to the oracle bones and other characters created later on within the same tradition. This includes PRC Simplified and Vietnamese Chu Nom, but not other characters used in various writing systems alongside the Han characters: Zhuyin, Hangul, Kana, Western (Arabic/Hindu) numerals, punctuation, etc. I prefer not to write or speak of Han scripts. I am willing to use CJK or CJKV for writing systems that make (or used to make) essential use of Han characters, even though both terms are seriously inaccurate. I prefer not to use geographical terms for linguistic ideas, except in the rare cases, like India, where the geographic boundaries were drawn to match linguistic divisions (based in their case, on religious divisions). I do not expect anybody in particular to agree with me on these usages, and you can talk to me if you have A Better Idea[TM]. YMMV. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of N.R.Liwal Sent: Wednesday, May 30, 2001 11:11 AM To: [EMAIL PROTECTED] Subject: RECOMMENDATIONs( Term Asian is not used properly on Computers and NET) TERM ASIA IN COMPUTER INTERNET (RECOMMENDATIONS UNICODE LIST MAY 2001) So far the recomendations are, that Asian Text Fonts can be called: -Han Fonts or Hanzi Fonts -East Asian Unified Fonts -East Asian Fonts Urghh. Chinese fonts Korean fonts Japanese fonts Chu Nom fonts etc. fonts CJK fonts Unicode fonts Script Can be classified as: -languages which Han ideographs -'ideographic languages' SCRIPT -East Asian Unified SCRIPT - East Asian SCRIPT Urghh. Urghh. Traditional Chinese writing system (Han with numerals, punctuation, etc., with or without Zhuyin) Simplified Chinese writing system (similarly) Korean writing system (Hangul with or without Hanja, but with numerals, etc.) Japanese writing system (Kanji, Hiragana, Katakana, numerals, symbols, etc.) In each case with the possibility of adding Latin alphabet (Pinyin, romaji) and perhaps Cyrillic and Greek. As I said earlier, there are no correct names except possibly by agreement. Asian geographic expressions are better: -Southeast Asia, East Asia CENRAL ASIA WEST ASIA = Arabic Countries and Neighborhood Triple Urghh. Have you ever heard the term granfaloon? The only association between location and language is *political*, and there is no nation without minorities. Let us speak with moderate precision of languages usually or sometimes written in Arabic script or ...Indic scripts and the like. Please. Thanks to all who participated in discussion: You're certainly welcome. N.R.Liwal Asiaosft http://www.liwal.netwww.liwal.net [snip] -- Edward Cherlin Generalist A knot! exclaimed Alice. Oh, do let me help to undo it. Alice in Wonderland
RE: RECOMMENDATIONs( Term Asian is not used properly on Computers and NET)
Um. Okay, what is the font supposed to have? Is this list correct?? 1. Han 2. Kana 3. Hangul 4. Those many, many Latin letters with diacritics for Vietnamese use 5. Probably also ASCII and misc. Han punctuation and similar odds and ends (sigh) Are you sure you want just *one* box for that? I think you want four. ARRRGGHH ★じゅういっちゃん★ "AIS TSXQ QDOO TD AISC TDQMIG, HYCTDL, ZIC HIIUPLB XSHM GDOPHPISX CYTDL." "QMD XDHCDQ, AIS XDD, PX QMDCD'X LI CDHPWD. P VSXQ WSQ RMYQ P MYED KA TA YCT PL."
RE: UTF-8S (was: Re: ISO vs Unicode UTF-8)
Kenneth Whistler wrote: Plane 14 PUA usage description tags? Naaah, nobody would suggest such a bizarre thing, would they? The three words PUA usage description are redundant, methinks. Removing them leaves a more concise and dramatic example of a weird proposal. _ Marco
Some Char. to Glyph Statistics, Pan/Single Font
The problem with your glyph statistics is that they are based on mould counts employed by the Monotype hot metal typesetters. The Monotype system was capable of extensive kerning, and therefore many glyphs were constructed from the elements provided by the moulds at the time of composition. The Monotype list of elements therefore comprises: Full characters which areeither basic or couldnot be composed satisfactorily by the system for whatever reason. These might properly be described as glyphs Elements which were combined either with the first set, or with one another, to create glyphs, or approximations to glyphs at the time of casting. These cannot really be considered to be glyphs, as such. However, if one allows that these elements are glyphs, the real number of glyphs employed by Monotype was limited by the matrix case: before 1962 to 225 sorts, and subsequently to 272 sorts. Although additional sorts might be available, they could only be used by substitution with another sort prior to any actual typesetting. More recent Monotype code pages for Bengali seem to be around 450elements, which are combined with floating elements to create text. To date all Indic script composition has been pretty much limited by technology. Taking Bengali as an example, Figgins, around 1826, employed 370 sorts, many of which are kerning versions of other sorts, allowing the composition either of consonant-vowel combinations or approximations to complex conjuncts which were insufficiently common to warrant the creation of separate punches. But again, a number of his sorts exist only to allow the incorporation of combinations which could not be produced by the technology of the time. Our recent revision of the Linotype Bengali code page extends to a font of some 980 elements. 136 of these are differently spaced floating elements, such vowel signs and chandrabindus, which haveno meaning separate from the main characters to which they may be attached, and which would be omitted from an opentype version.It also includes 146 characters whichduplicate the Unicode encoded Bengali characters, which is required for current technological reasons - Microsoft's Office XP does not allow the display of Unicode encode Bengali characters in the font, or at the size which is expected. So the "real" number of elements is 698.(I may also add that we have had to produce alternative versions of the same fonts in which non-spacing elements actually space quiteconsiderably, because ofthe very strange behaviour of Microsoft's Internet Explorer 5.5, so the glyph count islarger than the 980 - another case of technology determining counts). Turning to Devanagari, our researches indicate that the totalnumber of script units (In Unicode terms, combinations of consonants, halants, vowel signs and other signs), excluding the Unicode charactersin the range 0951 to 0954, in use is around the 5550 mark. It is actually greater than this, since there are a number of characters relating to Sanskrit sandhi for which we do not have any conjunct-vowel statistics. In principle, all these should be regarded asglyphs, thoughfew fonts are likely to implement them all (the slaves in this context needing to be human beings, since the issue of the spacing and modification of a smaller number of base elements to produce all these glyphs is an aesthetic rather than a mechanical problem) I have also not included in the count the many variant forms of glyphs which occur as result of differences in formulation for particular combinations. (I have also excluded the rather large number of glyphs which are to be found in the Mangal font supplied by Microsoft, but which seem to be there purely because of a rather strange and literal interpretation of the Unicode Devanagari shaping rules, on the grounds that these glyphs exist only in the font, and would never be used in text.)
RE: Some Char. to Glyph Statistics, Pan/Single Font
Hi. Well, it can be said to be above the minimum :-) depending on how you look at things. If you're a developer of embedded device with a really stringent requirement in memory footprint (for font and others), you may just go with 1:1 ratios for all three groups of Jamos (consonants and vowels) as found in old (mechanical) Hangul typewriters. However, as you can guess, the result is not pleasing to most eyes. Of course. If the requirements are even more stringent (e.g., the user is blind) you can even represent the letters with a 2x3 matrix of pixels. Similarly, when I was a child, the first companies that started using electronic brains to bill customers sent notes printed in all capital letters and with no apostrophes. The minimal model that I have in mind is slightly less minimal: the least quality that won't sacrifice the normal orthographic rules of a language. Ciao. Marco
RE: Some Char. to Glyph Statistics, Pan/Single Font
Mike Meir wrote: The problem with your glyph statistics is that they are based on mould counts employed by the Monotype hot metal typesetters. I agree: no one will ever come up with *the* correct count. Such general evaluations simply depend on too many things to be useful. E.g.: which language(s) are targeted, what degree of typographic excellence is required, and (as Mike explained very well) the kind of technology involved and its limitations. The simple fact that software fonts can overlay glyphs can cause a great factor of reduction, compared to lead type. Similarly, the fact that a software font technology has the capability of kerning glyphs vertically can reduce dramatically the inventory of glyphs needed for certain scripts. Moreover, different technologies may have totally different meanings for the word glyph. E.g., I have heard of Arabic fonts that analyze the Arabic script well under the level of a grapheme: segments of lines and individual dots were stored separately and assembled at display time. Comparing the number of glyphs in such an a font with the inventory of a more traditional font is what we call sum up apples and pears. Turning to Devanagari, our researches indicate that the total number of script units (In Unicode terms, combinations of consonants, halants, vowel signs and other signs), excluding the Unicode characters in the range 0951 to 0954, in use is around the 5550 mark. It is actually greater than this, since there are a number of characters relating to Sanskrit sandhi for which we do not have any conjunct-vowel statistics. As an opposite example for Devanagari, I did a little research on my own on a minimal rendering scheme for Unicode Indic scripts. The scenario behind this evaluation was low-resolution displays or printers and simple bitmapped fonts. For Devanagari's 77 characters (non-decomposable L and M characters) my set of glyphs was just 82 pieces. Of course, such a ratio (about 1:1.06) requires dropping any typographical gracefulness: of all the complexity of Devanagari, just a handful of half-consonants and ligatures was preserved. Neither your 5550 nor my 82 are of much use to anyone who has even slightly different requirements. However, the contrast between these two figures perhaps says something about the difficulty of such a count. _ Marco
RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)
Simon, I now see that you support both "UTF8" where surrogates are encoded as 6 bytes and "AL32UTF8" where surrogates are encoded as 4 bytes. The way your documentation reads many users are likely to select "UFT8" over "AL32UTF8". You should have users who already have UTF8 databases migrate to the proper UTF8 encoding rather than making them the exception to the rule. If you have this funny encoding please don't call it UTF8 because it is not UTF8 and will only confuse users. You could call it OTF8 or something like that but not UTF8. Carl -Original Message-From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Simon LawSent: Wednesday, May 30, 2001 11:02 AMTo: [EMAIL PROTECTED]Subject: Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)Hi Folks, Over the last few days, this email thread has generated many interesting discussions on the proposal of UTF-8s. At the same time some speculations have been generated on why Oracle is asking for this encoding form. I hope to clarify some of these misinformation in this email. In Oracle9i our next Database Release shipping this summer, we have introduced support for two new Unicode character sets. One is 'AL16UTF16' which supports the UTF-16 encoding and the other is 'AL32UTF8' which is the UTF-8 fully compliant character set. Both of these conform to the Unicode standard, and surrogate characters are stored strictly in 4 bytes. For more information on Unicode support in Oracle9i , please check out the whitepaper "The power of Globalization Technology" on http://otn.oracle.com/products/oracle9i/content.html The requests for UTF-8s came from many of our Packaged Applications customers (such as Peoplesoft , SAP etc.), the ordering of the binary sort is an important requirement for these Oracle customers. We are supporting them and we hope to turn this into a TR such that UTF-8s can be referenced by other vendors when they need to have compatible binary order for UTF-16 and UTF-8 across different platforms. The speculation that we are pushing for UTF-8s because we are trying to minimize our code change for supporting surrogates, or because of our unique database design are totally false. Oracle has a fully internationalized extensible architecture and have introduced surrogate support in Oracle9i. In fact we are probably the first database vendor to support both the UTF-16 and UTF-8 encoding forms, we will continue to support them and conform to future enhancements to the Unicode Standard. Regards Simon "Carl W. Brown" wrote: Ken, I suspect that Oracle is specifically pushing for this standard because of its unique data base design. In a sense Oracle almost picks it self up by its own bootstraps. It has always tried to minimize actual code. Therefore it was a natural choice to implement Unicode with UTF-8 because it is easy to reuse the multibyte support with minor changes to handle a different character length algorithm. This has been one of the reasons that Oracle has been successful. Its tinker toy like design has enabled them to quickly adapt and add new features. Now however, they should take the time do "do it right". Its UTF-8 storage creates problems for database designers because they can not predict field sizes. This is a problem with MBCS code pages but UTF-8s will make it worse. There will be lots of wasted storage when characters can vary in size from 1 to 6 bytes. Most other database systems require specific code to support Unicode. As a consequence most have implemented using UCS-2. Their migration is obviously to use UTF-16. UTF-8s buys them nothing but headaches. Carl -Original Message- From: Kenneth Whistler [mailto:[EMAIL PROTECTED]] Sent: Tuesday, May 29, 2001 3:47 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email) Carl, Ken, UTF-8s is essentially a way to ignore surrogate processing. It allows a company to encode UTF-16 with UCS-2 logic. The problem is that by not implementing surrogate support you can introduce subtle errors. For example it is common to break buffers apart into segments. These segments may be reconcatinated but they may be processed individually. You are preaching to the choir here. I didn't state that *I* was in favor of UTF-8S -- only that we have to be careful not to assume that UTC will obviously not support it. The proponents of UTF-8S are vigorously and actively campaigning for their proposal. In standardization committees, proposals that have committed, active proponents who can aim for the long haul, often have a way of getting adopted in one form or another, unless there are
RE: RECOMMENDATIONs( Term Asian is not used properly on Computers and NET)
If we mean CJK why can't we say CJK? Jony
RE: Some Char. to Glyph Statistics, Pan/Single Font
Jungshik Shin wrote: I think I know how you counted (initial consonants: two for syllables with and without final consonants, three for three kinds of vowel position/shape, vowels: two for syll. with/without final consonants) and think you got it right. You caught me with hands in jam: that was exactly my way of thinking. While I see that this is clearly too naive to be right, I would not be able to improve it any further myself. I welcome any refinement. Especially, I was curious about the other ratios (DOS 1:8,1:4,1:4; X11win 1:10,1:3,1:4; TrueType 1:~30) that you mentioned on your previous message. _ Marco
Re: RECOMMENDATIONs( Term Asian is not used properly on Computers and NET)
Dear Jungshik Shin; Thanks, good explinations, I hope those who are interested in Software and Web for Asia will be benefited. Thanks. Liwal - Original Message - On Wed, 30 May 2001, N.R.Liwal wrote: TERM ASIA IN COMPUTER INTERNET (RECOMMENDATIONS UNICODE LIST MAY 2001) So far the recomendations are, that Asian Text Fonts can be called: -Han Fonts or Hanzi Fonts As already pointed out, this is not adqueate to cover Korean and Japanese because other scripts are also used for them. Moreover, Japanese may not like 'Hanzi' even if you're talking about Hanzi/Kanji/Hanja alone. Even 'Han' (which is more neutral) could be balked at by some. -East Asian Unified Fonts -East Asian Fonts If they mean fonts for Chinese, Japanese and Korean writing systems, I would pick 'East Asian fonts'. Script Can be classified as: -languages which Han ideographs you're talking not about language(s) but about script(s) , right? -'ideographic languages' SCRIPT A language cannot be ideographic as I wrote before. Has anybody else mentioned this term other than me? I mentioned it not because I think it's appropriate BUT because I think that the term (ideographic language) MUST NOT be used. -East Asian Unified SCRIPT What's been 'unified' is Han 'ideographs' while there ARE other scripts in (more predominant) use in the region (even if you only mean Chinese,Japanese and Korean by 'East Asian'). - East Asian SCRIPT What 'script' (not 'scripts') are you talking about here? If you just mean 'Han ideographs', I don't think you need to come up with new term(s). I think 'Han ideograph' (or CJK ideographs if it ONLY means Hanzi/Kanji/Hanja and nothing else) is good enough (although certainly not perfect.) On the other hand, if you're talking about all the scripts used in Northeast/East Asian countries (or China, Japan and Korea), you CANNOT use any of the above with the possible exception of the last (which can be used provided that they're made plural 'East Asian Scripts' to reflect that there are *multiple* scripts in use.) Asian geographic expressions are better: -Southeast Asia, East Asia CENRAL ASIA WEST ASIA = Arabic Countries and Neighborhood I believe the following are widely used at least in 'geography text books' and 'encyclopedia'. Also, many US schools with regional studies programs use similar divisions (except for Southwest Asian which appears to be refered to as 'Middle East' most of time). This division is bound to be aribtrary to some degree (Asian continent is not a circle or any definitive geometric shape which can be divided in an unambiguous way ;-) ) East Asia/Northeast Asia : Japan, Korea, China (it's a huge country, but) 'Far East' (in Western media and at least in some East Asian media :-) ) Southeast Asia : Indochina, Malaysia,Singapor,Indonesia,Thai,Burma, . South Asia : India,Pakistan,Sri Lanka,Bangladesh,Nepal, Soutwest Asia: The part of Asia usually called 'Middle East' (in Western media and at least in some East Asian media :-) ) Arabian peninsular, Iran, Iraq, Turkey(Near East?), Afganistan(it could be put in South Asia...) Central Asia: Mongol and some former republics of USSR (now independent. e.g Kazahstan) North Asia (??) : Siberia? FYI, Mozilla uses the following: East Asian : Chinese, Japanese, Korean SE SW Asian : Thai, Armenian*, Turkish* Middle Eastern : Hebrew, Arabic Western European: ..., Greek*(why?),. Eastern European: I guess it's better than Office XP which calls Chinese,Japanese, Korean 'Asian', but it could still have done better. (Middle East and SW Asia overlap each other so that they had better split up SESW Asia, remove Middle East'ern', put Armenian, Turkish, Hebrew and Arabic into 'SW Asian' and fill up 'SE Asian' with Thai, Vietnamese, Cambodian and so forth when they get supported). That is, I would use the following for programs like web browser and word processor. East Asian : Chinese, Japanese, Korean + some more (or NE Asian) if necessary and supported (e.g Yi) SE Asian : Thai,Vietnamese,Lao, Khmer, etc South Asian : various Indic scripts (other than those included in SE Asian), Tibet* SW Asian : Arabic, Hebrew, Syriac, Armenian*, Turkish*, etc (Middle Eastern) Central Asian: Mongolian, Khazahstan(?), when supported Of course, geographic break-up has its pitfalls and some people for sure wouldn't like it for various reasons. For instance, Turkish and
RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)
If you have this funny encoding please don't call it UTF8 because it is not UTF8 and will only confuse users. You could call it OTF8 or something like that but not UTF8. How about WTF-8? Sorry - I couldn't resist. /|/|ike
RE: RECOMMENDATIONs( Term Asian is not used properly on Computers and NET)
Liwal, Such classifications are not easy. For example Azeri can be written in both Latin and Cyrillic scripts. The Latin script is much like Turkish which has the dotted and dot-less i. This is not necessarily be big issue for fonts but is requires special case shifting logic. What do you do about scripts that are not tied to a locale? The Orthodox Church uses a special Cyrillic font that is different from standard Cyrillic. The classifications vary not only by script but by how it affects you specific field of interest and the implementation. For example Unicode implements Ethiopian has fully formed syllabic characters. Some implementations use decomposed syllables. This allows 256 byte code pages but it requires glyph composition. This would make is similar to SE Asian and Indic processing. But with fully composed glyphs you would classify the language differently probably as a large characters set language with an input method editor like the CJK languages. Carl -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of N.R.Liwal Sent: Thursday, May 31, 2001 8:52 PM To: Jungshik Shin Cc: [EMAIL PROTECTED] Subject: Re: RECOMMENDATIONs( Term Asian is not used properly on Computers and NET) Dear Jungshik Shin; Thanks, good explinations, I hope those who are interested in Software and Web for Asia will be benefited. Thanks. Liwal - Original Message - On Wed, 30 May 2001, N.R.Liwal wrote: TERM ASIA IN COMPUTER INTERNET (RECOMMENDATIONS UNICODE LIST MAY 2001) So far the recomendations are, that Asian Text Fonts can be called: -Han Fonts or Hanzi Fonts As already pointed out, this is not adqueate to cover Korean and Japanese because other scripts are also used for them. Moreover, Japanese may not like 'Hanzi' even if you're talking about Hanzi/Kanji/Hanja alone. Even 'Han' (which is more neutral) could be balked at by some. -East Asian Unified Fonts -East Asian Fonts If they mean fonts for Chinese, Japanese and Korean writing systems, I would pick 'East Asian fonts'. Script Can be classified as: -languages which Han ideographs you're talking not about language(s) but about script(s) , right? -'ideographic languages' SCRIPT A language cannot be ideographic as I wrote before. Has anybody else mentioned this term other than me? I mentioned it not because I think it's appropriate BUT because I think that the term (ideographic language) MUST NOT be used. -East Asian Unified SCRIPT What's been 'unified' is Han 'ideographs' while there ARE other scripts in (more predominant) use in the region (even if you only mean Chinese,Japanese and Korean by 'East Asian'). - East Asian SCRIPT What 'script' (not 'scripts') are you talking about here? If you just mean 'Han ideographs', I don't think you need to come up with new term(s). I think 'Han ideograph' (or CJK ideographs if it ONLY means Hanzi/Kanji/Hanja and nothing else) is good enough (although certainly not perfect.) On the other hand, if you're talking about all the scripts used in Northeast/East Asian countries (or China, Japan and Korea), you CANNOT use any of the above with the possible exception of the last (which can be used provided that they're made plural 'East Asian Scripts' to reflect that there are *multiple* scripts in use.) Asian geographic expressions are better: -Southeast Asia, East Asia CENRAL ASIA WEST ASIA = Arabic Countries and Neighborhood I believe the following are widely used at least in 'geography text books' and 'encyclopedia'. Also, many US schools with regional studies programs use similar divisions (except for Southwest Asian which appears to be refered to as 'Middle East' most of time). This division is bound to be aribtrary to some degree (Asian continent is not a circle or any definitive geometric shape which can be divided in an unambiguous way ;-) ) East Asia/Northeast Asia : Japan, Korea, China (it's a huge country, but) 'Far East' (in Western media and at least in some East Asian media :-) ) Southeast Asia : Indochina, Malaysia,Singapor,Indonesia,Thai,Burma, . South Asia : India,Pakistan,Sri Lanka,Bangladesh,Nepal, Soutwest Asia: The part of Asia usually called 'Middle East' (in Western media and at least in some East Asian media :-) ) Arabian peninsular, Iran, Iraq, Turkey(Near East?), Afganistan(it could be put in South Asia...) Central Asia: Mongol and some former republics of USSR (now independent. e.g Kazahstan) North Asia (??) : Siberia? FYI, Mozilla uses the
RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)
From: Carl W. Brown [mailto:[EMAIL PROTECTED]] I resisted calling it FTF-8 (Funky Transfer Format - 8), but if you want to call it Weird Transfer Format - 8, I don't have any real objections. Well, that's ONE possible translation of WTF... /|/|ike
RE: RECOMMENDATIONs( Term Asian is not used properly on Computersand NET)
Thursday, May 31, 2001 We seem to have strayed from searching for a clearer term than Asian. I think part of the problem is that many language names are also national adjectives, e.g., Chinese, Japanese and Korean. Likewise names of scripts (or writing systems) are also often names of languages, e.g., Arabic. I would hope that input methods (for Chinese or Amharic charcters) remain a separate issue: so long as it results in a Unicode encoding that can be unambiguously shared, it should not matter what keystrokes were used. (An analogy might be QWERTY vs. Dvorak input not effecting ASCII.) Input methods are still important issue but a separate one. On Thu, 31 May 2001, Carl W. Brown wrote: Liwal, Such classifications are not easy. For example Azeri can be written in both Latin and Cyrillic scripts. The Latin script is much like Turkish which has the dotted and dot-less i. This is not necessarily be big issue for fonts but is requires special case shifting logic. What do you do about scripts that are not tied to a locale? The Orthodox Church uses a special Cyrillic font that is different from standard Cyrillic. The classifications vary not only by script but by how it affects you specific field of interest and the implementation. For example Unicode implements Ethiopian has fully formed syllabic characters. Some implementations use decomposed syllables. This allows 256 byte code pages but it requires glyph composition. This would make is similar to SE Asian and Indic processing. But with fully composed glyphs you would classify the language differently probably as a large characters set language with an input method editor like the CJK languages. Carl -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of N.R.Liwal Sent: Thursday, May 31, 2001 8:52 PM To: Jungshik Shin Cc: [EMAIL PROTECTED] Subject: Re: RECOMMENDATIONs( Term Asian is not used properly on Computers and NET) Dear Jungshik Shin; Thanks, good explinations, I hope those who are interested in Software and Web for Asia will be benefited. Thanks. Liwal - Original Message - On Wed, 30 May 2001, N.R.Liwal wrote: TERM ASIA IN COMPUTER INTERNET (RECOMMENDATIONS UNICODE LIST MAY 2001) So far the recomendations are, that Asian Text Fonts can be called: -Han Fonts or Hanzi Fonts As already pointed out, this is not adqueate to cover Korean and Japanese because other scripts are also used for them. Moreover, Japanese may not like 'Hanzi' even if you're talking about Hanzi/Kanji/Hanja alone. Even 'Han' (which is more neutral) could be balked at by some. -East Asian Unified Fonts -East Asian Fonts If they mean fonts for Chinese, Japanese and Korean writing systems, I would pick 'East Asian fonts'. Script Can be classified as: -languages which Han ideographs you're talking not about language(s) but about script(s) , right? -'ideographic languages' SCRIPT A language cannot be ideographic as I wrote before. Has anybody else mentioned this term other than me? I mentioned it not because I think it's appropriate BUT because I think that the term (ideographic language) MUST NOT be used. -East Asian Unified SCRIPT What's been 'unified' is Han 'ideographs' while there ARE other scripts in (more predominant) use in the region (even if you only mean Chinese,Japanese and Korean by 'East Asian'). - East Asian SCRIPT What 'script' (not 'scripts') are you talking about here? If you just mean 'Han ideographs', I don't think you need to come up with new term(s). I think 'Han ideograph' (or CJK ideographs if it ONLY means Hanzi/Kanji/Hanja and nothing else) is good enough (although certainly not perfect.) On the other hand, if you're talking about all the scripts used in Northeast/East Asian countries (or China, Japan and Korea), you CANNOT use any of the above with the possible exception of the last (which can be used provided that they're made plural 'East Asian Scripts' to reflect that there are *multiple* scripts in use.) Asian geographic expressions are better: -Southeast Asia, East Asia CENRAL ASIA WEST ASIA = Arabic Countries and Neighborhood I believe the following are widely used at least in 'geography text books' and 'encyclopedia'. Also, many US schools with regional studies programs use similar divisions (except for Southwest Asian which appears to be refered to as 'Middle East' most of time). This division is bound to be aribtrary to some degree (Asian continent is not a circle or any definitive geometric shape which can be divided in an unambiguous way ;-) ) East Asia/Northeast Asia : Japan, Korea, China (it's a huge country, but) 'Far East' (in Western
RE: Some Char. to Glyph Statistics, Pan/Single Font
Thursday, May 31, 2001 My goal was never to give a specific number of glyphs needed to display a particular Indian or other script. As others have pointed out, this depends among other things, on the particular display device and its font processing software possibly including the operating system. My goals were to point out that Arabic and South and Southeast Asian scripts require: 1. Many more glyphs than character codes and, 2. As important, software to render character codes legibly from the available glyphs. Discussions of a single Unicode font that do not mention such software seem pointless, or worse, managers might believe them. I wonder it we could usefully define levels of legibility for displaying a language or writing system, or is it too subjective? Is evoking a lam alef ligature when alef follows a lam the minimal level for any language using Arabic script? For languages using Devanagari script is transposing the short i matra (U+093F) to precede the consonant(s) it follows the minimum? Regards, Jim Agenbroad (disclaimer and address at bottom) On Thu, 31 May 2001, Marco Cimarosti wrote: Mike Meir wrote: The problem with your glyph statistics is that they are based on mould counts employed by the Monotype hot metal typesetters. I agree: no one will ever come up with *the* correct count. Such general evaluations simply depend on too many things to be useful. E.g.: which language(s) are targeted, what degree of typographic excellence is required, and (as Mike explained very well) the kind of technology involved and its limitations. The simple fact that software fonts can overlay glyphs can cause a great factor of reduction, compared to lead type. Similarly, the fact that a software font technology has the capability of kerning glyphs vertically can reduce dramatically the inventory of glyphs needed for certain scripts. Moreover, different technologies may have totally different meanings for the word glyph. E.g., I have heard of Arabic fonts that analyze the Arabic script well under the level of a grapheme: segments of lines and individual dots were stored separately and assembled at display time. Comparing the number of glyphs in such an a font with the inventory of a more traditional font is what we call sum up apples and pears. Turning to Devanagari, our researches indicate that the total number of script units (In Unicode terms, combinations of consonants, halants, vowel signs and other signs), excluding the Unicode characters in the range 0951 to 0954, in use is around the 5550 mark. It is actually greater than this, since there are a number of characters relating to Sanskrit sandhi for which we do not have any conjunct-vowel statistics. As an opposite example for Devanagari, I did a little research on my own on a minimal rendering scheme for Unicode Indic scripts. The scenario behind this evaluation was low-resolution displays or printers and simple bitmapped fonts. For Devanagari's 77 characters (non-decomposable L and M characters) my set of glyphs was just 82 pieces. Of course, such a ratio (about 1:1.06) requires dropping any typographical gracefulness: of all the complexity of Devanagari, just a handful of half-consonants and ligatures was preserved. Neither your 5550 nor my 82 are of much use to anyone who has even slightly different requirements. However, the contrast between these two figures perhaps says something about the difficulty of such a count. _ Marco Regards, Jim Agenbroad ( [EMAIL PROTECTED] ) The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.
RE: RECOMMENDATIONs( Term Asian is not used properly on Computersand NET)
James, One of the reasons for grouping CJK together is that they have similar implementation strategies. If we are grouping for that reason then maybe Aramaic languages should fall into the same category. In that case Asian is a very poor term to use. However Han/Hanzi does not work either. Implementation is very important. For example, Korean except for occasional Han characters if functionally much closer to Indic scripts. If it were not for the crude font handling of the older systems we probably would not implement Korean as a fully formed character set. Carl -Original Message- From: James E. Agenbroad [mailto:[EMAIL PROTECTED]] Sent: Thursday, May 31, 2001 12:30 PM To: Carl W. Brown Cc: [EMAIL PROTECTED] Subject: RE: RECOMMENDATIONs( Term Asian is not used properly on Computersand NET) Thursday, May 31, 2001 We seem to have strayed from searching for a clearer term than Asian. I think part of the problem is that many language names are also national adjectives, e.g., Chinese, Japanese and Korean. Likewise names of scripts (or writing systems) are also often names of languages, e.g., Arabic. I would hope that input methods (for Chinese or Amharic charcters) remain a separate issue: so long as it results in a Unicode encoding that can be unambiguously shared, it should not matter what keystrokes were used. (An analogy might be QWERTY vs. Dvorak input not effecting ASCII.) Input methods are still important issue but a separate one. On Thu, 31 May 2001, Carl W. Brown wrote: Liwal, Such classifications are not easy. For example Azeri can be written in both Latin and Cyrillic scripts. The Latin script is much like Turkish which has the dotted and dot-less i. This is not necessarily be big issue for fonts but is requires special case shifting logic. What do you do about scripts that are not tied to a locale? The Orthodox Church uses a special Cyrillic font that is different from standard Cyrillic. The classifications vary not only by script but by how it affects you specific field of interest and the implementation. For example Unicode implements Ethiopian has fully formed syllabic characters. Some implementations use decomposed syllables. This allows 256 byte code pages but it requires glyph composition. This would make is similar to SE Asian and Indic processing. But with fully composed glyphs you would classify the language differently probably as a large characters set language with an input method editor like the CJK languages. Carl -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of N.R.Liwal Sent: Thursday, May 31, 2001 8:52 PM To: Jungshik Shin Cc: [EMAIL PROTECTED] Subject: Re: RECOMMENDATIONs( Term Asian is not used properly on Computers and NET) Dear Jungshik Shin; Thanks, good explinations, I hope those who are interested in Software and Web for Asia will be benefited. Thanks. Liwal - Original Message - On Wed, 30 May 2001, N.R.Liwal wrote: TERM ASIA IN COMPUTER INTERNET (RECOMMENDATIONS UNICODE LIST MAY 2001) So far the recomendations are, that Asian Text Fonts can be called: -Han Fonts or Hanzi Fonts As already pointed out, this is not adqueate to cover Korean and Japanese because other scripts are also used for them. Moreover, Japanese may not like 'Hanzi' even if you're talking about Hanzi/Kanji/Hanja alone. Even 'Han' (which is more neutral) could be balked at by some. -East Asian Unified Fonts -East Asian Fonts If they mean fonts for Chinese, Japanese and Korean writing systems, I would pick 'East Asian fonts'. Script Can be classified as: -languages which Han ideographs you're talking not about language(s) but about script(s) , right? -'ideographic languages' SCRIPT A language cannot be ideographic as I wrote before. Has anybody else mentioned this term other than me? I mentioned it not because I think it's appropriate BUT because I think that the term (ideographic language) MUST NOT be used. -East Asian Unified SCRIPT What's been 'unified' is Han 'ideographs' while there ARE other scripts in (more predominant) use in the region (even if you only mean Chinese,Japanese and Korean by 'East Asian'). - East Asian SCRIPT What 'script' (not 'scripts') are you talking about here? If you just mean 'Han ideographs', I don't think you need to come up with new term(s). I think 'Han ideograph' (or CJK ideographs if it ONLY means Hanzi/Kanji/Hanja and nothing else) is good enough (although certainly not perfect.) On the other hand, if you're talking about all the scripts used in Northeast/East Asian countries (or China, Japan and Korea), you CANNOT use any of the above with the possible exception of the last (which can be used provided that they're made plural 'East Asian Scripts' to reflect that
RE: Some Char. to Glyph Statistics, Pan/Single Font
At 5:35 PM +0200 5/31/01, Marco Cimarosti wrote: Jungshik Shin wrote: I think I know how you counted (initial consonants: two for syllables with and without final consonants, three for three kinds of vowel position/shape, vowels: two for syll. with/without final consonants) and think you got it right. You caught me with hands in jam: that was exactly my way of thinking. While I see that this is clearly too naive to be right, I would not be able to improve it any further myself. I welcome any refinement. Especially, I was curious about the other ratios (DOS 1:8,1:4,1:4; X11win 1:10,1:3,1:4; TrueType 1:~30) that you mentioned on your previous message. _ Marco A quick look at the Hangul syllable table starting on page 744 of TOS3 shows a much greater variation. If you look at the pages slightly cross-eyed so that each glyph aligns with a neighbor, and wink each eye alternately, you can get the effect of a blink comparator of the type used in astronomy before computer image processing became practical. If you can't keep the alignment while winking, just look for the fuzzy letters where the glyphs don't match up. Or we could ask a typographer. :-) -- Edward Cherlin Generalist A knot! exclaimed Alice. Oh, do let me help to undo it. Alice in Wonderland
RE: Some Char. to Glyph Statistics, Pan/Single Font
At 5:12 PM +0200 5/31/01, Marco Cimarosti wrote: Hi. Well, it can be said to be above the minimum :-) depending on how you look at things. If you're a developer of embedded device with a really stringent requirement in memory footprint (for font and others), you may just go with 1:1 ratios for all three groups of Jamos (consonants and vowels) as found in old (mechanical) Hangul typewriters. However, as you can guess, the result is not pleasing to most eyes. The manual Hangul typewriter I learned on had multiple forms for initial consonants, supplied by means of an extra shift level. (Yes! A mechanical buckybit!! %-[ ) The really minimal level was *linear* Hangul produced by the telegraph system. [snip] The minimal model that I have in mind is slightly less minimal: the least quality that won't sacrifice the normal orthographic rules of a language. Which rules are the normal ones? Every publisher I've had anything to do with has used different sets of rules, over quite a wide range. We can't even agree whether ligatures are required in English, or whether an ASCII-sorted index is sufficiently human-readable. Ciao. Marco -- Edward Cherlin Generalist A knot! exclaimed Alice. Oh, do let me help to undo it. Alice in Wonderland