Re: (long) Making orthographies computer-ready (was *not* Telephoning Tamil)
On 07/30/2002 06:32:07 AM John Cowan wrote: >> The industry needs to wake up to the fact >> that the requirement that a language have an ISO-639 2-letter code before a >> locale can be created is a dead end. > >These words deserve to be written up in letters of gold. I like that. (Where are those chromatic fonts when you need them? :-) >> Well, I'm pretty sure Hawaiian isn't going to get it, > >It seems to me that if the request were backed by the State of Hawaii >(or one of its agencies) it would meet the ISO 639-1 criteria, save >perhaps for the number of speakers. Hawai'ian is an official language >of Hawaii, after all. But that's not the sole criterion a language has to meet. The requirements are explained at http://linux.infoterm.org/infoterm-e/i-infoterm.htm?raiso639-1_start.htm~Mitte. Quoting: The following criteria for defining new languages in ISO 639-1 has been established by the ISO 639 Joint Advisory Committee. Relation to ISO 639-2. Since ISO 639-1 is to remain a subset of ISO 639-2, it must first satisfy the requirements for ISO 639-2 and also satisfy the following. Documentation. a significant body of existing documents (specialized texts, such as college or university textbooks, technical documentation manuals, specialized journals, subject-field related books, etc.) written in specialized languages a number of existing terminologies in various subject fields (e.g. technical dictionaries, specialized glossaries, vocabularies, etc. in printed or electronic form) Recommendation.A recommendation and support of a specialized authority (such as a standards organization, governmental body, linguistic institution, or cultural organization) Other considerations the number of speakers of the language community the recognized status of the language in one or more countries the support of the request by one or more official bodies Hawaiian meets some of these requirements, but it has not (AFAIK) been developed to the point of having special terminologies in various subject fields. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: <[EMAIL PROTECTED]>
Re: (long) Making orthographies computer-ready (was *not* Telephoning Tamil)
[EMAIL PROTECTED] scripsit: > The industry needs to wake up to the fact > that the requirement that a language have an ISO-639 2-letter code before a > locale can be created is a dead end. These words deserve to be written up in letters of gold. > Well, I'm pretty sure Hawaiian isn't going to get it, It seems to me that if the request were backed by the State of Hawaii (or one of its agencies) it would meet the ISO 639-1 criteria, save perhaps for the number of speakers. Hawai'ian is an official language of Hawaii, after all. > Instead of asking > for a 2-letter code, the engineers should have been looking at what it > would take to make the software support a 3-letter code (which already > exists in ISO 639-2). Indeed, an arbitrary-length string should be supported. GNOME at least seems to have no trouble with this: the art-lojban localization has not run into any problems with its 10-character language code. -- John Cowan<[EMAIL PROTECTED]> http://www.reutershealth.com http://www.ccil.org/~cowan Yakka foob mog. Grug pubbawup zink wattoom gazork. Chumble spuzz. -- Calvin, giving Newton's First Law "in his own words"
RE: (long) Making orthographies computer-ready (was *not* Telephoning Tamil)
> > > One that occurs to me might be the Khoisan languages of Africa, > > which I believe commonly use "!" (U+0021) for a click sound. > > This is almost exactly the same problem you are describing for Tongva. > > U+01C3 LATIN LETTER RETROFLEX CLICK (General Category Lo) was > encoded precisely for this. It is to be *distinguished* from > U+0021 '!' EXCLAMATION MARK to avoid all of the processing problems > which would attend having a punctuation mark as part of your letter > orthography. A Khoisan orthography keyboard should distinguish the > two characters (if, indeed, it makes any use at all of the exclamation > mark per se), so that users can tell them apart and enter them > correctly. > Amazing! It is there (and has been "forever", since it has a Unicode 1.0 name) and doesn't even normalize to ol' U+0021. Nonetheless, I suspect that the exclamation mark's origin was in the use of ASCII for the otherwise unrepresented sounds and that the "should" in your note remains at least somewhat unrealized. A brief Googling of Khoisan produces pages that use !, #, //, and ' for the clicks encoded by U+01c0->U+01c3, including the Rosetta Project page which is encoded as UTF-8 (!!), but uses the ASCII characters, not the specially encoded variants cited. Of course, none of the sites I searched was actually IN one of these languages. Every one that I saw was in English (one had a link to an Afrikaans page). Perhaps the various Khoisan peoples who have web pages are using the Unicode characters in question. But the likely prevalence of English (or at least Western European) keyboards and systems probably has encouraged the widespread non-adoption of the correct characters (hence, this may be the example that proves the rule, although I can't think of anything else that looks more like a click than a bang or an octothorpe ;-). Best Regards, Addison
Re: (long) Making orthographies computer-ready (was *not* Telephoning Tamil)
> One that occurs to me might be the Khoisan languages of Africa, > which I believe commonly use "!" (U+0021) for a click sound. > This is almost exactly the same problem you are describing for Tongva. U+01C3 LATIN LETTER RETROFLEX CLICK (General Category Lo) was encoded precisely for this. It is to be *distinguished* from U+0021 '!' EXCLAMATION MARK to avoid all of the processing problems which would attend having a punctuation mark as part of your letter orthography. A Khoisan orthography keyboard should distinguish the two characters (if, indeed, it makes any use at all of the exclamation mark per se), so that users can tell them apart and enter them correctly. --Ken
RE: (long) Making orthographies computer-ready (was *not* Telephoning Tamil)
I know, hence the jocular tone with wink-and-smile. You are much more likely to get people's attention if you have a by-god-two-letter code than if you don't. (Today) you just can't ignore the perception that two-letter codes are somehow "legit" and three-letter codes somehow aren't... and that too many locale structures are based explicitly on the two-letter flavor. On the other hand, I suspect that the two-letter dogma is more past-history than actual technical requirement. For example, there are real Solaris locales with names like "japanese". Java allows you to ask for/construct a locale with any pair/trio of strings (said locale doesn't have any meaning, since you can't populate the data files). And so on. Just because no one makes locales using 3-letter codes doesn't mean it isn't technically impossible. (But it doesn't mean that there is no restriction either.) Of course, I understand why a company might make a business decision not to make and support a locale for a language that doesn't qualify for a two-letter code. Lack of compelling business reasons to build, change, or test support for minority languages is more a limiter here probably than active engineering work preventing it. Addison > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On > Behalf Of [EMAIL PROTECTED] > Sent: Monday, July 29, 2002 3:02 PM > To: [EMAIL PROTECTED] > Subject: Re: (long) Making orthographies computer-ready (was *not* > Telephoning Tamil) > > > > On 07/29/2002 03:56:36 PM "Addison Phillips [wM]" wrote: > > >Nonetheless, if you glance at the "SpecialCasing" file in Unicode, you > will > >note that almost without exception the entries are locale driven. The > first > >stop in creating a new orthography (or computerizing an existing one, > perhaps > >from the days of the typewriter), for my money would probably be to get > ISO-639 > >to issue the language a 2-letter code so you can have locale (and Unicode > >character database) data tagged with it ;-). > > OK, now you've hit a hot button: The industry needs to wake up to the fact > that the requirement that a language have an ISO-639 2-letter > code before a > locale can be created is a dead end. There just aren't enough 2-letter > codes to go around, and ISO 639-2 has restrictive requirements for doling > out 2-letter codes -- it wasn't created for the benefit of locale > implementers, but for the benefit of terminologists. Luiseño and Tongva > simply are not candidates. This very issue was raised with the > relevant ISO > committee in relation to Hawaiian: a 2-letter code was requested > specifically because someone was trying to get a Unix implementation > developed and was told by the engineers that it couldn't be done > without an > ISO 2-letter code. Well, I'm pretty sure Hawaiian isn't going to get it, > because it doesn't meet the requirements for ISO 639-1. Instead of asking > for a 2-letter code, the engineers should have been looking at what it > would take to make the software support a 3-letter code (which already > exists in ISO 639-2). > > > > - Peter > > > -- > - > Peter Constable > > Non-Roman Script Initiative, SIL International > 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA > Tel: +1 972 708 7485 > E-mail: <[EMAIL PROTECTED]> > > > > >
Re: (long) Making orthographies computer-ready (was *not* Telephoning Tamil)
On 07/29/2002 03:56:36 PM "Addison Phillips [wM]" wrote: >Nonetheless, if you glance at the "SpecialCasing" file in Unicode, you will >note that almost without exception the entries are locale driven. The first >stop in creating a new orthography (or computerizing an existing one, perhaps >from the days of the typewriter), for my money would probably be to get ISO-639 >to issue the language a 2-letter code so you can have locale (and Unicode >character database) data tagged with it ;-). OK, now you've hit a hot button: The industry needs to wake up to the fact that the requirement that a language have an ISO-639 2-letter code before a locale can be created is a dead end. There just aren't enough 2-letter codes to go around, and ISO 639-2 has restrictive requirements for doling out 2-letter codes -- it wasn't created for the benefit of locale implementers, but for the benefit of terminologists. Luiseño and Tongva simply are not candidates. This very issue was raised with the relevant ISO committee in relation to Hawaiian: a 2-letter code was requested specifically because someone was trying to get a Unix implementation developed and was told by the engineers that it couldn't be done without an ISO 2-letter code. Well, I'm pretty sure Hawaiian isn't going to get it, because it doesn't meet the requirements for ISO 639-1. Instead of asking for a 2-letter code, the engineers should have been looking at what it would take to make the software support a 3-letter code (which already exists in ISO 639-2). - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: <[EMAIL PROTECTED]>
(long) Making orthographies computer-ready (was *not* Telephoning Tamil)
There are always consequences... ... but I am saying that you could build a locale that would work. Generally speaking, most programming environments do not look at the Unicode character database for the operations in question, or at least, don't look directly that those tables. They use custom generated tables or code. For example, from what I know of Java's internal structure, it would be relatively easy to construct the necessary classes. For example, you can create a rule string for RuleBasedCollator that does collation of @, since the collator doesn't look at the character properties when performing sorting (normalization is another matter, though). A BreakIterator can be fashioned that doesn't break on the @ character. Localized strings (as in DateFormat's list of month names, for example) are just strings. And so on. The consequences would generally come into play when you encounter code that DOES look at Unicode properties (or looks at a table that is not locale-driven). You'll get transient failures in that case. IOW> the Unicode properties are not just guides. Building "complete Unicode support" means taking all the special cases and special pleading into account. Creating a new orthography for a minority language should probably take this into account, since what one is doing in a small, insular community may be ignored or resisted by Unicode implementers, especially if the result cannot be easily fit into existing support mechanisms. The best course of action, if you have the freedom to pursue it, is to choose characters that have properties similar to those of the orthographic unit you are mapping. "@" has lots of problems (it isn't legal as a "word-part" in a URL, for example), it is identified as punctuation (so code that doesn't know about your locale may word- or line-break on it), it has no case mapping (so you're at the mercy of SpecialCasing, etc.). It is likely that any special cases that you create for ASCII characters will be more of an annoyance for Unicode implementers and thus tend not to be supported. Avoiding the creation of special cases is a Good Idea. There are, of course, several orthographies, some with quite large speaker populations, that have this potential issue. One that occurs to me might be the Khoisan languages of Africa, which I believe commonly use "!" (U+0021) for a click sound. This is almost exactly the same problem you are describing for Tongva. Nonetheless, if you glance at the "SpecialCasing" file in Unicode, you will note that almost without exception the entries are locale driven. The first stop in creating a new orthography (or computerizing an existing one, perhaps from the days of the typewriter), for my money would probably be to get ISO-639 to issue the language a 2-letter code so you can have locale (and Unicode character database) data tagged with it ;-). Best Regards, Addison Addison P. Phillips Director, Globalization Architecture webMethods, Inc. 432 Lakeside Drive Sunnyvale, California, USA +1 408.962.5487 (phone) +1 408.210.3569 (mobile) - Internationalization is an architecture. It is not a feature. > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On > Behalf Of Curtis Clark > Sent: Friday, July 26, 2002 11:23 PM > To: [EMAIL PROTECTED] > Subject: Re: REALLY *not* Tamil - changing scripts (long) > > > Addison Phillips [wM] wrote: > > Obviously I'm not an expert in these linguistic areas (and hence > > rarely comment on them), but it seems to me that the lack of other > > mechanisms makes Unicode an attractive target for criticism in this > > area. > > Certainly no Unicode-bashing was intended (I'm more of a Unicode > evangelist). I guess I'm confused about the use of Unicode character > properties. Are you saying that, even though Unicode defines U+0027 as > punctuation, other, I could use it as a glottal stop and create a locale > that would treat it as a letter (and still be "Unicode compliant", > whatever that is?). And if that's the case, are the Unicode properties > just guides? Could I develop an orthography where YÃÑبձâ would be a > word, and there would be no consequences? > > -- > Curtis Clark http://www.csupomona.edu/~jcclark/ > Mockingbird Font Works http://www.mockfont.com/ > > > >