Re: (long) Making orthographies computer-ready (was *not* Telephoning Tamil)

2002-07-30 Thread Peter_Constable


On 07/30/2002 06:32:07 AM John Cowan wrote:

>> The industry needs to wake up to the fact
>> that the requirement that a language have an ISO-639 2-letter code
before a
>> locale can be created is a dead end.
>
>These words deserve to be written up in letters of gold.

I like that. (Where are those chromatic fonts when you need them? :-)


>> Well, I'm pretty sure Hawaiian isn't going to get it,
>
>It seems to me that if the request were backed by the State of Hawaii
>(or one of its agencies) it would meet the ISO 639-1 criteria, save
>perhaps for the number of speakers.  Hawai'ian is an official language
>of Hawaii, after all.

But that's not the sole criterion a language has to meet. The requirements
are explained at
http://linux.infoterm.org/infoterm-e/i-infoterm.htm?raiso639-1_start.htm~Mitte.

Quoting:


   
 The following criteria for defining new languages in ISO 639-1 has been   
 established by the ISO 639 Joint Advisory Committee.  
   Relation to ISO 639-2. Since ISO 639-1 is to remain a subset of ISO 
   639-2, it must first satisfy the requirements for ISO 639-2 and also
   satisfy the following.  
   Documentation.  
 a significant body of existing documents (specialized texts, such 
 as college or university textbooks, technical documentation   
 manuals, specialized journals, subject-field related books, etc.) 
 written in specialized languages  
 a number of existing terminologies in various subject fields  
 (e.g. technical dictionaries, specialized glossaries, 
 vocabularies, etc. in printed or electronic form) 
   Recommendation.A recommendation and support of a specialized authority  
   (such as a standards organization, governmental body, linguistic
   institution, or cultural organization)  
   Other considerations
 the number of speakers of the language community  
 the recognized status of the language in one or more countries
 the support of the request by one or more official bodies 
   




Hawaiian meets some of these requirements, but it has not (AFAIK) been
developed to the point of having special terminologies in various subject
fields.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>







Re: (long) Making orthographies computer-ready (was *not* Telephoning Tamil)

2002-07-30 Thread John Cowan

[EMAIL PROTECTED] scripsit:

> The industry needs to wake up to the fact
> that the requirement that a language have an ISO-639 2-letter code before a
> locale can be created is a dead end. 

These words deserve to be written up in letters of gold.

> Well, I'm pretty sure Hawaiian isn't going to get it,

It seems to me that if the request were backed by the State of Hawaii
(or one of its agencies) it would meet the ISO 639-1 criteria, save
perhaps for the number of speakers.  Hawai'ian is an official language
of Hawaii, after all.

> Instead of asking
> for a 2-letter code, the engineers should have been looking at what it
> would take to make the software support a 3-letter code (which already
> exists in ISO 639-2).

Indeed, an arbitrary-length string should be supported.  GNOME at least
seems to have no trouble with this: the art-lojban localization has not
run into any problems with its 10-character language code.

-- 
John Cowan<[EMAIL PROTECTED]> 
http://www.reutershealth.com  http://www.ccil.org/~cowan
Yakka foob mog.  Grug pubbawup zink wattoom gazork.  Chumble spuzz.
-- Calvin, giving Newton's First Law "in his own words"




RE: (long) Making orthographies computer-ready (was *not* Telephoning Tamil)

2002-07-29 Thread Addison Phillips [wM]

>
> > One that occurs to me might be the Khoisan languages of Africa,
> > which I believe commonly use "!" (U+0021) for a click sound.
> > This is almost exactly the same problem you are describing for Tongva.
>
> U+01C3 LATIN LETTER RETROFLEX CLICK (General Category Lo) was
> encoded precisely for this. It is to be *distinguished* from
> U+0021 '!' EXCLAMATION MARK to avoid all of the processing problems
> which would attend having a punctuation mark as part of your letter
> orthography. A Khoisan orthography keyboard should distinguish the
> two characters (if, indeed, it makes any use at all of the exclamation
> mark per se), so that users can tell them apart and enter them
> correctly.
>
Amazing! It is there (and has been "forever", since it has a Unicode 1.0
name) and doesn't even normalize to ol' U+0021. Nonetheless, I suspect that
the exclamation mark's origin was in the use of ASCII for the otherwise
unrepresented sounds and that the "should" in your note remains at least
somewhat unrealized. A brief Googling of Khoisan produces pages that use !,
#, //, and ' for the clicks encoded by U+01c0->U+01c3, including the Rosetta
Project page which is encoded as UTF-8 (!!), but uses the ASCII characters,
not the specially encoded variants cited.

Of course, none of the sites I searched was actually IN one of these
languages. Every one that I saw was in English (one had a link to an
Afrikaans page). Perhaps the various Khoisan peoples who have web pages are
using the Unicode characters in question. But the likely prevalence of
English (or at least Western European) keyboards and systems probably has
encouraged the widespread non-adoption of the correct characters (hence,
this may be the example that proves the rule, although I can't think of
anything else that looks more like a click than a bang or an octothorpe ;-).

Best Regards,

Addison






Re: (long) Making orthographies computer-ready (was *not* Telephoning Tamil)

2002-07-29 Thread Kenneth Whistler


> One that occurs to me might be the Khoisan languages of Africa, 
> which I believe commonly use "!" (U+0021) for a click sound. 
> This is almost exactly the same problem you are describing for Tongva.

U+01C3 LATIN LETTER RETROFLEX CLICK (General Category Lo) was
encoded precisely for this. It is to be *distinguished* from
U+0021 '!' EXCLAMATION MARK to avoid all of the processing problems
which would attend having a punctuation mark as part of your letter
orthography. A Khoisan orthography keyboard should distinguish the
two characters (if, indeed, it makes any use at all of the exclamation
mark per se), so that users can tell them apart and enter them
correctly.

--Ken





RE: (long) Making orthographies computer-ready (was *not* Telephoning Tamil)

2002-07-29 Thread Addison Phillips [wM]

I know, hence the jocular tone with wink-and-smile. You are much more likely
to get people's attention if you have a by-god-two-letter code than if you
don't. (Today) you just can't ignore the perception that two-letter codes
are somehow "legit" and three-letter codes somehow aren't... and that too
many locale structures are based explicitly on the two-letter flavor.

On the other hand, I suspect that the two-letter dogma is more past-history
than actual technical requirement. For example, there are real Solaris
locales with names like "japanese". Java allows you to ask for/construct a
locale with any pair/trio of strings (said locale doesn't have any meaning,
since you can't populate the data files). And so on. Just because no one
makes locales using 3-letter codes doesn't mean it isn't technically
impossible. (But it doesn't mean that there is no restriction either.)

Of course, I understand why a company might make a business decision not to
make and support a locale for a language that doesn't qualify for a
two-letter code. Lack of compelling business reasons to build, change, or
test support for minority languages is more a limiter here probably than
active engineering work preventing it.

Addison

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of [EMAIL PROTECTED]
> Sent: Monday, July 29, 2002 3:02 PM
> To: [EMAIL PROTECTED]
> Subject: Re: (long) Making orthographies computer-ready (was *not*
> Telephoning Tamil)
>
>
>
> On 07/29/2002 03:56:36 PM "Addison Phillips [wM]" wrote:
>
> >Nonetheless, if you glance at the "SpecialCasing" file in Unicode, you
> will
> >note that almost without exception the entries are locale driven. The
> first
> >stop in creating a new orthography (or computerizing an existing one,
> perhaps
> >from the days of the typewriter), for my money would probably be to get
> ISO-639
> >to issue the language a 2-letter code so you can have locale (and Unicode
> >character database) data tagged with it ;-).
>
> OK, now you've hit a hot button: The industry needs to wake up to the fact
> that the requirement that a language have an ISO-639 2-letter
> code before a
> locale can be created is a dead end. There just aren't enough 2-letter
> codes to go around, and ISO 639-2 has restrictive requirements for doling
> out 2-letter codes -- it wasn't created for the benefit of locale
> implementers, but for the benefit of terminologists. Luiseño and Tongva
> simply are not candidates. This very issue was raised with the
> relevant ISO
> committee in relation to Hawaiian: a 2-letter code was requested
> specifically because someone was trying to get a Unix implementation
> developed and was told by the engineers that it couldn't be done
> without an
> ISO 2-letter code. Well, I'm pretty sure Hawaiian isn't going to get it,
> because it doesn't meet the requirements for ISO 639-1. Instead of asking
> for a 2-letter code, the engineers should have been looking at what it
> would take to make the software support a 3-letter code (which already
> exists in ISO 639-2).
>
>
>
> - Peter
>
>
> --
> -
> Peter Constable
>
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <[EMAIL PROTECTED]>
>
>
>
>
>





Re: (long) Making orthographies computer-ready (was *not* Telephoning Tamil)

2002-07-29 Thread Peter_Constable


On 07/29/2002 03:56:36 PM "Addison Phillips [wM]" wrote:

>Nonetheless, if you glance at the "SpecialCasing" file in Unicode, you
will
>note that almost without exception the entries are locale driven. The
first
>stop in creating a new orthography (or computerizing an existing one,
perhaps
>from the days of the typewriter), for my money would probably be to get
ISO-639
>to issue the language a 2-letter code so you can have locale (and Unicode
>character database) data tagged with it ;-).

OK, now you've hit a hot button: The industry needs to wake up to the fact
that the requirement that a language have an ISO-639 2-letter code before a
locale can be created is a dead end. There just aren't enough 2-letter
codes to go around, and ISO 639-2 has restrictive requirements for doling
out 2-letter codes -- it wasn't created for the benefit of locale
implementers, but for the benefit of terminologists. Luiseño and Tongva
simply are not candidates. This very issue was raised with the relevant ISO
committee in relation to Hawaiian: a 2-letter code was requested
specifically because someone was trying to get a Unix implementation
developed and was told by the engineers that it couldn't be done without an
ISO 2-letter code. Well, I'm pretty sure Hawaiian isn't going to get it,
because it doesn't meet the requirements for ISO 639-1. Instead of asking
for a 2-letter code, the engineers should have been looking at what it
would take to make the software support a 3-letter code (which already
exists in ISO 639-2).



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>







(long) Making orthographies computer-ready (was *not* Telephoning Tamil)

2002-07-29 Thread Addison Phillips [wM]

There are always consequences...

... but I am saying that you could build a locale that would work. Generally speaking, 
most programming environments do not look at the Unicode character database for the 
operations in question, or at least, don't look directly that those tables. They use 
custom generated tables or code. For example, from what I know of Java's internal 
structure, it would be relatively easy to construct the necessary classes.

For example, you can create a rule string for RuleBasedCollator that does collation of 
@, since the collator doesn't look at the character properties when performing sorting 
(normalization is another matter, though). A BreakIterator can be fashioned that 
doesn't break on the @ character. Localized strings (as in DateFormat's list of month 
names, for example) are just strings. And so on.

The consequences would generally come into play when you encounter code that DOES look 
at Unicode properties (or looks at a table that is not locale-driven). You'll get 
transient failures in that case.

IOW> the Unicode properties are not just guides. Building "complete Unicode support" 
means taking all the special cases and special pleading into account. Creating a new 
orthography for a minority language should probably take this into account, since what 
one is doing in a small, insular community may be ignored or resisted by Unicode 
implementers, especially if the result cannot be easily fit into existing support 
mechanisms.

The best course of action, if you have the freedom to pursue it, is to choose 
characters that have properties similar to those of the orthographic unit you are 
mapping. "@" has lots of problems (it isn't legal as a "word-part" in a URL, for 
example), it is identified as punctuation (so code that doesn't know about your locale 
may word- or line-break on it), it has no case mapping (so you're at the mercy of 
SpecialCasing, etc.). It is likely that any special cases that you create for ASCII 
characters will be more of an annoyance for Unicode implementers and thus tend not to 
be supported. Avoiding the creation of special cases is a Good Idea.

There are, of course, several orthographies, some with quite large speaker 
populations, that have this potential issue. One that occurs to me might be the 
Khoisan languages of Africa, which I believe commonly use "!" (U+0021) for a click 
sound. This is almost exactly the same problem you are describing for Tongva.

Nonetheless, if you glance at the "SpecialCasing" file in Unicode, you will note that 
almost without exception the entries are locale driven. The first stop in creating a 
new orthography (or computerizing an existing one, perhaps from the days of the 
typewriter), for my money would probably be to get ISO-639 to issue the language a 
2-letter code so you can have locale (and Unicode character database) data tagged with 
it ;-).

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.
432 Lakeside Drive
Sunnyvale, California, USA
+1 408.962.5487 (phone)  
+1 408.210.3569 (mobile)
-
Internationalization is an architecture.
It is not a feature. 



> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Curtis Clark
> Sent: Friday, July 26, 2002 11:23 PM
> To: [EMAIL PROTECTED]
> Subject: Re: REALLY *not* Tamil - changing scripts (long)
> 
> 
> Addison Phillips [wM] wrote:
>  > Obviously I'm not an expert in these linguistic areas (and hence
>  > rarely comment on them), but it seems to me that the lack of other
>  > mechanisms makes Unicode an attractive target for criticism in this
>  > area.
> 
> Certainly no Unicode-bashing was intended (I'm more of a Unicode 
> evangelist). I guess I'm confused about the use of Unicode character 
> properties. Are you saying that, even though Unicode defines U+0027 as 
> punctuation, other, I could use it as a glottal stop and create a locale 
> that would treat it as a letter (and still be "Unicode compliant", 
> whatever that is?). And if that's the case, are the Unicode properties 
> just guides? Could I develop an orthography where Yßяبձ⁋ would be a 
> word, and there would be no consequences?
> 
> -- 
> Curtis Clark  http://www.csupomona.edu/~jcclark/
> Mockingbird Font Works  http://www.mockfont.com/
> 
> 
> 
>