RE: Re[2]: Errata in language/script list
> I had assumed (yeah, I know...) that (1) was "Languages and their >Scripts", (2) would be found in the Ethnologue, and (3) was the Roadmap; >however I discovered today that the Ethnologue does not appear to list >scripts. Is this so, or was I not looking in the right place? It does in some cases, but not consistently. This is one area in which I'd also like to see the Ethnologue improve. It's mainly a matter of research and resources, though, and so there's no way for me to say how quickly that will happen. It didn't used to list scripts at all, though, and it has started to include that, so that's a step forward. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: <[EMAIL PROTECTED]>
RE: Re[2]: Errata in language/script list
> From: Thomas Chan [mailto:[EMAIL PROTECTED]] > e.g., If someone asked 1-2 (pre-Unicode 3.1) years ago the > question, "Can > I write Cantonese with Unicode?", the answer would have been > "no" or "not > really". If it were asked today, the answer would be "yes". > But try that > question today with other minority Chinese languages > substituted in it, > and the answer is still pretty much a "no" or "not really". Hmmm - on the other hand, given the effectively boundless syllabary, wouldn't the answer be "mostly" for all dialects? True, some dialects fit better than others, but still, I doubt that any dialect is unusable. In any case, I do not think that this list is a good place for tracking the encoding status of various languages, which is effectively what you are proposing. > BTW, what do you consider to be a "darn small userbase", numberwise? > Would the UCAS or Cherokee userbases be too "small" by your > standards to > include a mention of them? Probably. I think I had best make clear here. I am talking about what I feel belongs in the "Languages and Scripts" list, NOT what belongs in Unicode! As I see it, if a language isn't in active use on the internet (above some arbitrary but small threshhold), it shouldn't be listed, on the assumption that there isn't any content. > I'm sure there are more than twelve people who use it for > writing and/or > research, though. Start counting with the number of people who write > in it, and add to that figure the researchers and their > assistants (i.e., > their students) who are doing the surveys... I deliberately do not include the researchers, since they should already know the name of the script... ;-) > > > And this is without going into historical alternative > ways of writing > > > Chinese, such as the prolific Guanhua Zimu alphabet/syllabary > > > used in the > > > 1900s-1920s. > > > > ...which we don't really need to do, I think, since > we're trying to > > stick to the useful stuff. > > What do you consider "useful"? What one person considers "useless" is > "useful" to someone else. Without specific requirements like userbase > size, economic power, cultural significance, extant writings, > etc, I don't > think we can start making any claims about usefulness. You listed an END date for that script, which I translate as "dead script". Since dead scripts are of interest only to academics and hobbyists, who already know their names, I consider it useless to put it on the list. I mean no judgement of the script, its language(s), or any people who have used it. > Anyway, I don't see usefulness as one of the requisites for > inclusion on > the list in question. This is where we disagree, perhaps. I see three useful lists: 1.) A brief list for nonexperts who need to communicate in an unfamiliar tongue. This list would be limited to languages which have shown some small presence on the internet. 2.) A comprehensive listing of languages and the scripts they use. 3.) A list of the encoding status of scripts by language (to deal with, among other things, the Chinese dialect problem). I had assumed (yeah, I know...) that (1) was "Languages and their Scripts", (2) would be found in the Ethnologue, and (3) was the Roadmap; however I discovered today that the Ethnologue does not appear to list scripts. Is this so, or was I not looking in the right place? > Its clear to me that you have a very low opinion of minority > languages, > scripts, and characters. No, although I admit that it may seem that way from this thread. I hope that my explanations above have cleared this up. > Whether or not transliteration is beyond the > scope of the list in question is one issue, and I agree that > it would open > up the possibility of listing almost every language with almost every > script (or at least, Latin). But what's your rationale for > claims like > "bastard children"? (And what is that supposed to mean, anyway?) It means that transliteration scripts are not native to the language. Perhaps I am hypersensitive to this, having had pinyin forced down my throat by an overzealous teacher (I am among those who find it to be as much hindrance as help), but I think that there is a certain mentality extant that it is acceptable to use transliteration instead of a language's native script; I consider this mentality to be simply wrong, and feel that listing transliteration scripts for languages would give undue respectability to using transliteration scripts, especially given that Unicode removes the need for most transliteration scripts. /|/|ike
RE: Re[2]: Errata in language/script list
On Mon, 13 Aug 2001, Ayers, Mike wrote: > > From: Thomas Chan [mailto:[EMAIL PROTECTED]] > > No, they do. While the dominant way that Chinese languages > > are written > > today, which is based on Mandarin Chinese, has been well > > supported since > > pre-Unicode 3.0 days, other Chinese languages have faced the > > problem of > > many unencoded (or yet-to-be-encoded) characters. I've > > written on this > > matter on this list before in the past, principly about Yue > > Chinese (=~ > > Cantonese), but also applicable to other Chinese languages. > > Since those all will get coded into the Chinese alphabet (if they > get coded), what's the point? It's pretty simple. Just because enough of a script is encoded for the needs of one language doesn't mean that is necessarily true for other languages that use that script. In time, those omissions are patched up in newer versions of Unicode. Latin, Cyrillic, Arabic, and other scripts have all had new characters added to them in sucessive versions of Unicode. e.g., If someone asked 1-2 (pre-Unicode 3.1) years ago the question, "Can I write Cantonese with Unicode?", the answer would have been "no" or "not really". If it were asked today, the answer would be "yes". But try that question today with other minority Chinese languages substituted in it, and the answer is still pretty much a "no" or "not really". > > Some also require different scripts, such as the Dungan living in the > > former Soviet Union, who write in Cyrillic (I've been told all the > > characters they need are encoded), or some Min Chinese, who > > write in whole > > or part using the characters in the Bopomofo Extended block (Unicode > > 3.0) and/or Latin (using certain letter and diacritics that > > weren't always > > If you get genuine exceptions, then list them (i.e. list "Min > Chinese"). I get the feeling that you're talking about a darn small > userbase here, though. According to the SIL Ethnologue 14th ed.[1], Dungan (SIL "DNG"): 38,000 in Kyrgyzstan (1993 Johnstone). Mother tongue speakers were 95% out of an ethnic population of 52,000 in the former USSR (1979 census). Population total all countries 49,400 out of an ethnic population of 100,000. [1] http://www.ethnologue.com/show_language.asp?code=DNG I don't have figures for the size of the userbase of Min Chinese written in Latin script offhand, but see for instance "Proposal to add Latin characters required by Latinized Taiwanese languages to ISO/IEC 10646"[2] (1997.6.26) under the "user community" questions. [2] http://www.egt.ie/standards/la/taioan.html (Did this ever become a WG2 document? I recall seeing discussions of this once, but can't find them offhand at the moment.) BTW, what do you consider to be a "darn small userbase", numberwise? Would the UCAS or Cherokee userbases be too "small" by your standards to include a mention of them? > > encoded). There's also the Hunan women who write in the > > unencoded Nushu > > script that was discussed on this rather recently. > > Discussed well enough for me to know that we're talking about a > userbase of approximately twelve and counting down. This is not a very > pressing case. No, its probably not pressing at the moment. I'm sure there are more than twelve people who use it for writing and/or research, though. Start counting with the number of people who write in it, and add to that figure the researchers and their assistants (i.e., their students) who are doing the surveys... > > And this is without going into historical alternative ways of writing > > Chinese, such as the prolific Guanhua Zimu alphabet/syllabary > > used in the > > 1900s-1920s. > > ...which we don't really need to do, I think, since we're trying to > stick to the useful stuff. What do you consider "useful"? What one person considers "useless" is "useful" to someone else. Without specific requirements like userbase size, economic power, cultural significance, extant writings, etc, I don't think we can start making any claims about usefulness. The Bible (or portions of it) has been published using Guanhua Zimu[3]. Is that not "useful" to someone? [3] From Eugene A. Nida, ed., _Book of a Thousand Tongues_, 2nd ed. (London: United Bible Societies, 1972): http://deall.ohio-state.edu/grads/chan.200/misc/guanhua_zimu.jpg If you think historical scripts are not useful, then perhaps the four Phillipine scripts, Ogham, Runic, etc should not be mentioned on the list. Anyway, I don't see usefulness as one of the requisites for inclusion on the list in question. > > And then there are various transliteration schemes, which > > although they > > are not anyone's primary script, but which are widely > > employed, such as > > Hanyu Pinyin (people do ask, as legacy GB2312 and Big5 character sets > > don't have them, or only include ugly full-width versions) > > for Mandarin, > > or Yale for Cantonese (e.g., people ask if a precomposed "m
RE: Re[2]: Errata in language/script list
> From: Thomas Chan [mailto:[EMAIL PROTECTED]] > > > "dialects", e.g. Mandarin, Cantonese, Hakka, etc.). This > > > split may be handy > > > because the different languages could need different information. > > > > They don't. The joy of unification! > > No, they do. While the dominant way that Chinese languages > are written > today, which is based on Mandarin Chinese, has been well > supported since > pre-Unicode 3.0 days, other Chinese languages have faced the > problem of > many unencoded (or yet-to-be-encoded) characters. I've > written on this > matter on this list before in the past, principly about Yue > Chinese (=~ > Cantonese), but also applicable to other Chinese languages. Since those all will get coded into the Chinese alphabet (if they get coded), what's the point? > Some also require different scripts, such as the Dungan living in the > former Soviet Union, who write in Cyrillic (I've been told all the > characters they need are encoded), or some Min Chinese, who > write in whole > or part using the characters in the Bopomofo Extended block (Unicode > 3.0) and/or Latin (using certain letter and diacritics that > weren't always If you get genuine exceptions, then list them (i.e. list "Min Chinese"). I get the feeling that you're talking about a darn small userbase here, though. > encoded). There's also the Hunan women who write in the > unencoded Nushu > script that was discussed on this rather recently. Discussed well enough for me to know that we're talking about a userbase of approximately twelve and counting down. This is not a very pressing case. > And this is without going into historical alternative ways of writing > Chinese, such as the prolific Guanhua Zimu alphabet/syllabary > used in the > 1900s-1920s. ...which we don't really need to do, I think, since we're trying to stick to the useful stuff. > There is also the blind, for which Braille schemes exist for at least > Mandarin and Cantonese, although I'll concede that Braille > could be listed > for almost any language. Yep. That was discussed some time ago. > And then there are various transliteration schemes, which > although they > are not anyone's primary script, but which are widely > employed, such as > Hanyu Pinyin (people do ask, as legacy GB2312 and Big5 character sets > don't have them, or only include ugly full-width versions) > for Mandarin, > or Yale for Cantonese (e.g., people ask if a precomposed "m" > with a grave > accent is encoded, as that is need to transcribe the negative). Transliteration scripts should be treated like the bastard children that they are and accorded no status. Listing them would only cause unnecessary confusion. /|/|ike
RE: Re[2]: Errata in language/script list
On Wed, 1 Aug 2001, Ayers, Mike wrote: > > From: Marco Cimarosti [mailto:[EMAIL PROTECTED]] > > BTW, I notice that a single "Chinese" entry is listed. This > > should probably > > be split in several entries for the various Chinese languages (or > > "dialects", e.g. Mandarin, Cantonese, Hakka, etc.). This > > split may be handy > > because the different languages could need different information. > > They don't. The joy of unification! No, they do. While the dominant way that Chinese languages are written today, which is based on Mandarin Chinese, has been well supported since pre-Unicode 3.0 days, other Chinese languages have faced the problem of many unencoded (or yet-to-be-encoded) characters. I've written on this matter on this list before in the past, principly about Yue Chinese (=~ Cantonese), but also applicable to other Chinese languages. Some also require different scripts, such as the Dungan living in the former Soviet Union, who write in Cyrillic (I've been told all the characters they need are encoded), or some Min Chinese, who write in whole or part using the characters in the Bopomofo Extended block (Unicode 3.0) and/or Latin (using certain letter and diacritics that weren't always encoded). There's also the Hunan women who write in the unencoded Nushu script that was discussed on this rather recently. And this is without going into historical alternative ways of writing Chinese, such as the prolific Guanhua Zimu alphabet/syllabary used in the 1900s-1920s. There is also the blind, for which Braille schemes exist for at least Mandarin and Cantonese, although I'll concede that Braille could be listed for almost any language. And then there are various transliteration schemes, which although they are not anyone's primary script, but which are widely employed, such as Hanyu Pinyin (people do ask, as legacy GB2312 and Big5 character sets don't have them, or only include ugly full-width versions) for Mandarin, or Yale for Cantonese (e.g., people ask if a precomposed "m" with a grave accent is encoded, as that is need to transcribe the negative). Thomas Chan [EMAIL PROTECTED]
RE: Re[2]: Errata in language/script list
> From: Marco Cimarosti [mailto:[EMAIL PROTECTED]] > This is not correct: I have found the term "Han" or "hanzi" > in any kind of > literature, not only on Unicode documentation. "Hanzi" is a loan word which I have also often seen (usually written in italics as it should be), but I never said I hadn't - only "Han", which I have seen used to describe the Chinese people, but never their writing. > I am not sure, however, that the two terms are 100% the same, > in Western > languages. "Hanzi" is less ethnically marked than "Chinese > characters", > regardless that they mean exactly the same thing. Only in some eyes. In any case, what has "ethnic marking" to do with it? > (The choice between synonyms is rarely neutral, when politics > are involved. If this is the case in Chinese, it is a well kept secret. I have challenged this list before to show me evidence of political issues concerning the Chinese/Han thing - no evidence surfaced. The research I have done indicates that the main reason behind the two ways to say "Chinese" (zhongguo/han) is linguistic, not political. In any case, none of this talk changes the fact that telling people that they need a "Han" font to view Chinese is rather unhelpful. > BTW, I notice that a single "Chinese" entry is listed. This > should probably > be split in several entries for the various Chinese languages (or > "dialects", e.g. Mandarin, Cantonese, Hakka, etc.). This > split may be handy > because the different languages could need different information. They don't. The joy of unification! /|/|ike
RE: Re[2]: Errata in language/script list
On Tue, 31 Jul 2001, Marco Cimarosti wrote: > BTW, I notice that a single "Chinese" entry is listed. This should probably > be split in several entries for the various Chinese languages (or > "dialects", e.g. Mandarin, Cantonese, Hakka, etc.). This split may be handy > because the different languages could need different information. In the absence of additional qualifying information, I think "Chinese" would be interpreted as the most salient variety, the modern standard written Chinese (based on Mandarin Chinese; SIL "CHN") in dominant use today by speakers of all Chinese languages. However, some people might have questions asking for details like "Does Unicode have traditional characters?" and/or "Does Unicode have simplified characters?"--it might even be worth pointing out that both can be used concurrently, which is not what people accustomed to the likes of GB2312, Big5, etc would expect. Still others might ask, "Does Unicode have Cantonese/Hong Kong characters?" (the terms are not exactly synonymous, but often interchanged). Prior to Unicode 3.1's introduction of the Han characters in Plane 2, I'd say that support for Yue Chinese (SIL "YUH"; ~= Cantonese) was not really usable. With a logosyllabic script, it'll never be possible to exhaustively check that all its characters included, but it looks very usable now--I've had high success rates finding them in Plane 2, partially due to sourcing from the HKSCS character set (H source) from Hong Kong, and partially due to sourcing from large dictionaries such as the _Hanyu Da Zidian_ (G-HZ source) where characters (and the words they transcribe) have died out in Mandarin, but are preserved in Yue and other Chinese languages. However, I'm not so sure what the situation is for other Chinese languages, other than a vague impression that they are not well supported--probably the stage that Yue Chinese was at with Unicode 2.1. e.g., U+20547 is used only in Min Chinese (MNP, CFR), meaning 'hard, durable', with a pseudo-Mandarin reading of dian4. It's in Unicode only because it happened to be in HKSCS, and to my knowledge that is the only character set it appears in, perhaps for the use of Chaozhou speakers ("Chiuchow", "Teochew"), a linguistic minority in Hong Kong (Chaozhou is a dialect of Minnan Chinese, CFR). U+20547 is also documented only in very few dictionaries, none of which were apparently a source for Unicode. I think any support for Min Chinese at this point is probably accidental. (FYI, U+20547 looks like U+6709 with the two center strokes removed and replaced by U+4E36.) Thomas Chan [EMAIL PROTECTED]
RE: Re[2]: Errata in language/script list
Mike Ayers wrote: > > From: Kenneth Whistler [mailto:[EMAIL PROTECTED]] > > > > Also, I see that the script for Chinese is listed as "Han", not > > > "Chinese". Must we insist on confusing people? > > > > The script in question is designated "Han" in the Unicode Standard, > > and has always been so, in part because it is also used for > Japanese, > > Korean, and Vietnamese, and not just Chinese. > > That might be an argument, if "Han" *didn't* mean > "Chinese". The > script is called "Chinese characters" in the local language in every > language I've come across, including English. Only amidst > the Unicode crowd have I encountered any other name. This is not correct: I have found the term "Han" or "hanzi" in any kind of literature, not only on Unicode documentation. It is true that "Chinese" or "Chinese characters" is also common and, personally, I tend to prefer it. I am not sure, however, that the two terms are 100% the same, in Western languages. "Hanzi" is less ethnically marked than "Chinese characters", regardless that they mean exactly the same thing. (The choice between synonyms is rarely neutral, when politics are involved. In Italian, e.g., both "tedesco" and "germanico" simply mean "German". However, while "tedesco" is the normal word to be used, "germanico" took a political connotation during the fascist regime, and people who use it now probably want to show their far-right affiliation.) > No, Japanese and Korean should be listed as using > Chinese script as well, since they do. They do, using the term "Han". BTW, I notice that a single "Chinese" entry is listed. This should probably be split in several entries for the various Chinese languages (or "dialects", e.g. Mandarin, Cantonese, Hakka, etc.). This split may be handy because the different languages could need different information. Ciao. Marco
RE: Re[2]: Errata in language/script list
> From: Kenneth Whistler [mailto:[EMAIL PROTECTED]] > > Also, I see that the script for Chinese is listed as "Han", not > > "Chinese". Must we insist on confusing people? > > The script in question is designated "Han" in the Unicode Standard, > and has always been so, in part because it is also used for Japanese, > Korean, and Vietnamese, and not just Chinese. That might be an argument, if "Han" *didn't* mean "Chinese". The script is called "Chinese characters" in the local language in every language I've come across, including English. Only amidst the Unicode crowd have I encountered any other name. > So I think it would > be equally or more confusing to list the *script* for Chinese > as Chinese, implying that it is somehow different from the Han > script otherwise so designated in many places on the Unicode website > and in the Unicode Standard. No, Japanese and Korean should be listed as using Chinese script as well, since they do. Look, if the UTC wants to invent a new name for the script in order to tap dance their way around political sensitivities that simply do not exist, they can dance to their heart's content for all I care. However, the web page in question is being posted as a service to those who know little to nothing about the languages and scripts in question, and the term "Han font", when typed into most search engines, does not tend to lead one towards a font suitable for viewing Chinese, Japanese, or Korean text. "Chinese font", on the other hand, tends to put useful links on the first screenful. "CJK font" is also productive, but not quite as productive as "Chinese font". So perhaps you should consider listing the script as "Han(Chinese)" or "Han(CJK)", so that the list can serve its purpose. /|/|ike
RE: Re[2]: Errata in language/script list
Mike Ayers responded to Philipp Reichmuth: > > Chinese - Han, Latin[4] > > Arabic - Arabic, Latin[4], Cyrillic[4] > > This at least would be more accurate, but I do not think that it > would be worth the effort for a few reasons: [[ Good reasons omitted ]] > > Also, I see that the script for Chinese is listed as "Han", not > "Chinese". Must we insist on confusing people? The script in question is designated "Han" in the Unicode Standard, and has always been so, in part because it is also used for Japanese, Korean, and Vietnamese, and not just Chinese. So I think it would be equally or more confusing to list the *script* for Chinese as Chinese, implying that it is somehow different from the Han script otherwise so designated in many places on the Unicode website and in the Unicode Standard. --Ken
RE: Re[2]: Errata in language/script list
> From: Philipp Reichmuth [mailto:[EMAIL PROTECTED]] > On a side note of course it would by now probably make sense > to add "Latin" as alphabet to Chinese as well since hanyu pinyin has > been adopted as some sort of official latinization system by the > Chinese government, but that's an entirely different matter. As I understand it, the chief reason that the PRC declared a standard for romanization was so that they could at least look at two transcriptions of Chinese words and know if they were the same. However, I have learned that there have at least been experiments in the PRC with teaching the pinyin to children. The young man with whom I spoke told me that they had taught him pinyin in the first grade[1], but began teaching hanzi in the second grade without using pinyin (pinyin was never used again). In any event, it is incorrect to use pinyin to write Chinese if hanzi are available, so I do not think that it would be correct to list Latin as an alphabet for Chinese, as it would be incorrect to list any alphabet for any language when that alphabet is only used by the illiterate (in that language) or those unable to write properly. > Maybe one could introduce another abbreviation such as > "[4] = In use only for transliteration of the language" > and thus add the transliteration scripts for the languages? For > languages such as Arabic or Chinese where the Latin and Cyrillic (only > for arabic) transliterations enjoy rather extensive use in the > linguist community, it would allow them to see if they can write their > transcribed academic work using unicode as well. > > We'd then have, for example: > Chinese - Han, Latin[4] > Arabic - Arabic, Latin[4], Cyrillic[4] This at least would be more accurate, but I do not think that it would be worth the effort for a few reasons: 1) This list is mainly so that people who are not familiar with a language can determine if they have the fonts to view a language. Throwing in the extra scripts is more likely to confuse people than enlighten them. 2) The scripts for transliterating languages tend to be sensitive to who's doing the transliteration. 3) It would take a good deal of effort to keep up on which scripts are being used to discuss which language. 4) The folks doing the transliteration are likely to already know what scripts to use. If they don't, I suspect that they'd check with whomever they intend to exchange, not the Unicode website. Also, I see that the script for Chinese is listed as "Han", not "Chinese". Must we insist on confusing people? /|/|ike [1] - I never pinned down his age at the time of these events, so the references to "grades" are simply approximate comparisons to the American educational system.
Re: Re[2]: Errata in language/script list
On 07/27/2001 08:35:18 PM Philipp Reichmuth wrote: >>> - Ge'ez: Ge'ez is not used anymore except for liturgical purposes, so >>> I'd consider it a bit problematic to specify a country where it's >>> spoken. I'd probably remove the "Eritrea, Ethiopia" country >>> specification. > >PD> Ge'ez is also used in comparative Semitic linguistics (primarily by >PD> biblical and ANE scholars). See Thomas O. Lambdin "Introduction to >PD> Classical Ethiopic (Ge'ez)", Harvard Semitic Studies, vol. 24, Scholars >PD> Press, 1978. > >This is right, of course, in so far as Ge'ez is an important language >within comparative Semitic studies. Is it not also a liturgical language that is currently used as such? >However, as far as I understand >the language list under discussion here, it encompasses languages as >they are spoken. If this distinction is not made, then the concept of >the entire list will have to be altered a bit; for example, for >practically every single language in the list one would have to add >the script "Latin" because it is most probably being used in some >Latin transcription or the other within linguistic studies of the >respective language. But note that Ge'ez is listed as written with Ethiopic script, not Latin. I think the list is fine as is. If someone is interested in a language like Ge'ez, whether for current liturgical purposes or for purposes of historical or literary research, they may want to work with digital text in that language, and therefore be interested in knowing whether or not it is supported by Unicode. This list tells them that is it. I agree that Romanisations of various scripts is quite another matter, and would involve a number of changes in this list, but I don't think that's an issue for the Ge'ez entry. >In addition, then it makes even less sense to specify countries for >dead/liturgical/etc. (i.e. not used in everyday conversation) >languages since most scientific activity in comparative linguistics >(at least in Semitics) takes place outside of the area of origin of >the respective language. Note that the entry for Latin doesn't list a country. On the other hand, it may make sense to list a country if liturgical usage is predominantly in certain countries only. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: <[EMAIL PROTECTED]>