Re: Common Locale Data Repository Project

2004-04-24 Thread Peter Kirk
On 23/04/2004 17:15, Philippe Verdy wrote:

...

Think more recently about the new codification for Serbo-Croatian, and the split
of sh, with no definition except that it is country based (Serbian, Croatian,
Bosnian, Montenegrin), assimuming that one country uses only one language when
in fact there are several in the same one, that are shared by multiple
countries, and differ mostly by their script...
 

These are language which were probably originally somewhat artificially 
unified, to be the main language of the old Yugoslavia, and which since 
the old Yugoslavia fell apart have rapidly diverged.

When it comes down to it, whether the speech varieties used in two 
different areas are counted as one language or as separate ones is down 
to the choice and self-perception of the speakers. For now, many 
Belgians prefer to say that they speak French, although their spoken 
dialect is no doubt quite different from Parisian French and their 
written form is not identical. A time may come when they decide they 
want their own language, Walloon. At that time they will no doubt ask 
for appropriate ISO etc codes. That would be the choice of the people of 
Belgium, and it would not the business of standards committees (or the 
French) to tell them what to call their language.

A language has been defined as a dialect with an army.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



RE: Common Locale Data Repository Project

2004-04-24 Thread Peter Constable
 From: Mark Davis [mailto:[EMAIL PROTECTED]


 You can reiterate it all you want; in practice, 3066 tags are used as
 locale
 identifiers. And for a narrow sense of locales, that is perfectly
 reasonable.
 For a broad sense of locale, including timezone, user's currency,
 religious
 preference, etc., it clearly would not be reasonable, and I would
agree
 with you
 for that.

But there are a lot of people that don't know enough to recognize that
difference. So, even though a language identifier may be sufficient in
many cases to name a locale, it is IMO very unhelpful to refer to RFC
3066 tags as locale identifiers as it perpetuates and leads people into
wrong assumptions. Please help improve common understanding by not
referring to them as locale IDs.



 ISO 639 is not unstable. It is an open code set that is being added
to
 over time, but I don't think that should be referred to as unstable --
 that term suggests other things.
 
 ISO 3066 has *demonstrated* instability,

I take it you mean ISO 3166? I did not make any claim in that regard.


 However, there is no policy documented
 *anywhere* that
 says they won't. 

I'm working on it. The ISO 639/RA-JAC has acknowledged the need for
stability. Getting into the normative text of the standards takes a
little time.


Peter Constable



RE: Common Locale Data Repository Project

2004-04-24 Thread Peter Constable
 A time may come when they decide they
 want their own language, Walloon. At that time they will no doubt ask
 for appropriate ISO etc codes.

There's nothing futuristic about that: wln
(http://www.loc.gov/standards/iso639-2/englangn.html#uvwxyz)



Peter Constable



RE: Common Locale Data Repository Project

2004-04-24 Thread Peter Constable
 From: Philippe Verdy [mailto:[EMAIL PROTECTED]


 What is already unstable in ISO639 is the deprecation of iw and the
 addition
 of he, same thing for in and id or for yi and ji. Don't you
call
 that
 unstability? 

I think there is a misunderstanding here. As I understand it, ISO 639-1
actually never included iw, in or ji. But somehow, something got
published listing those (I don't know those exact details). So there was
mixed info out there indicating both iw and he, etc. To resolve the
apparent ambiguity, the ISO 639/RA-JAC had to state that the IDs iw,
in and ji were deprecated.


 Think more recently about the new codification for Serbo-Croatian, and
the
 split
 of sh, with no definition except that it is country based (Serbian,
 Croatian,
 Bosnian, Montenegrin), assimuming that one country uses only one
language
 when
 in fact there are several in the same one, that are shared by multiple
 countries, and differ mostly by their script...

I don't disagree that there aren't some difficult areas, such as this.
The differences intended by sr, bs and hr do *not* have to do with
script -- i.e. one cannot assume that any of these imply any particular
script. They also don't imply a particular region (Serbian could be
spoken outside Serbia), though clearly one country is most likely. They
*do* imply linguistic differences. Here's the difficulty: in those
countries, claims are made that there are linguistic differences, so
much so that it is problematic to sell products there that claim support
for Serbo-Croatian. On the other hand, given a document in one of
these, it's difficult to say that it's specifically one of them and not
the other two. ISO 639-3 will provide a macro-language identifier for
Serbo-Croatian, so it will be possible to tag a document without make
that distinction. 


 
 Also if ISO3166 is unstable 

I made no claim regarding stability of ISO 3166. 


 Serbia-Montenegro?), then it introduces unstability too within ISO
3066 or
 its
 proposed replacement

1. It is and IETF specification, not an ISO standard; the designation is
**RFC** 3066.

2. The draft successor to RFC 3066 addresses this very issue.

3. (a bit on the nit-picking side, IMO, but there have been three
comments on this) RFC 3066 will be *superceded*, not replaced.




 For now, the only workable solution to solve these issues is found in
 supplementary libraries in ICU which support locale aliases. (Yes I
use
 the
 terme Locale because this is the term that Java gives to this
 identification,

NO. That is the term Java (and other things) give to a *different*
identification. There are languages, there are cultures/locales. The two
are not the same.



Peter Constable



Re: Common Locale Data Repository Project

2004-04-24 Thread Philippe Verdy
From: Peter Constable [EMAIL PROTECTED]
  For now, the only workable solution to solve these issues is found in
  supplementary libraries in ICU which support locale aliases. (Yes I
  use the terme Locale because this is the term that Java gives to this
  identification,

 NO. That is the term Java (and other things) give to a *different*
 identification. There are languages, there are cultures/locales. The two
 are not the same.

Then there will remain a problem in Java locales, unless the Java community
accepts that the language part of a locale will contain will the language
subtags of RFC 3066 or its successor, so that the API can implement a language
resolver for that part only, ignoring the second and third parameter that will
be used only to specify other (non-language) elements of a Locale.

For now it's well known that if you create a Java application with resources
bundles for Hebrew, you have to use the iw language parameter to name your
bundle; if you use he, then the same properties file or class part of a bundle
will not be found on a OS that the Java runtime determines as supporting the
iw locale, and the application will then display only the default locale (most
often English). Note that Hebrew is part of the set of fully supported languages
in Java. I doubt that the JRE will be changed to use now the he  code by
default as long as the locale resolver in Java is not updated to use a more
clever algorithm than just equality of language codes.

Same problem for the Simplified Chinese language: Java supports it natively only
with the TW country code separately from the zh language code. If things
must change later, the Java runtime should learn to work with a zh-Hant
language identifier to be used in every country where the language is used.
Using zh_TW (i.e. a separate zh language code and the separate TW country
code) has the bad effect of also applying other locale standards appriate only
for Taiwan, but not for Macau, Hong Kong, Singapore, the Reunion and other
Indian Ocean, South Asian and South African countries or territories where this
language is used with other national locale conventions (currenty, time and
numeric formats, phone numbers...)

In fact I would like to see that Traditional and Simplified Chinese are
distinct languages in the same family. And an application would better use zht
and zhs language codes to make the distinction, so that zh would become an
identifier for a family of Han-written languages, rather than a language
identifier, and so a legacy code. This means also changes in the Locale resolver
so that a OS and user locale which indicates zhs or zht will first look for
resources marked with their respective language code, and later will attempt to
use a zh resource if not found.

A Locale resolver should be able to determine, from each properties or class of
a bundle, which codes it may support, and a degree/priority of matching face to
other localized resources. But I have not seen anything that suggests that an
application may be able to provide such Locale resolver; for now each
application has to write its own resolver to map a user locale to a matching
application-defined supported locale. The automatic resolver in Java (but other
systems like POSIX have the same caveats) seem quite ill, as well as the
resolution order (a bit more general) currently suggested in RFC 3066 which is
exactly what was implemented in Java...




RE: Common Locale Data Repository Project

2004-04-24 Thread Peter Constable
 From: Philippe Verdy [mailto:[EMAIL PROTECTED]

 In fact I would like to see that Traditional and Simplified
Chinese
 are
 distinct languages in the same family. And an application would better
use
 zht
 and zhs language codes to make the distinction, so that zh would
 become an
 identifier for a family of Han-written languages, rather than a
language
 identifier, and so a legacy code.

In ISO 639-3, zh will be considered a macro-language identifier. But zhs
and zht would not be good ideas, and will not be considered for ISO 639
or for RFC 3066.


Peter Constable



Re: Common Locale Data Repository Project

2004-04-24 Thread Mark Davis
comments below.

Mark
__
http://www.macchiato.com
  

- Original Message - 
From: Peter Constable [EMAIL PROTECTED]
To: Unicode List [EMAIL PROTECTED]
Sent: Sat, 2004 Apr 24 06:12
Subject: RE: Common Locale Data Repository Project


  From: Mark Davis [mailto:[EMAIL PROTECTED]


  You can reiterate it all you want; in practice, 3066 tags are used as
  locale
  identifiers. And for a narrow sense of locales, that is perfectly
  reasonable.
  For a broad sense of locale, including timezone, user's currency,
  religious
  preference, etc., it clearly would not be reasonable, and I would
 agree
  with you
  for that.

 But there are a lot of people that don't know enough to recognize that
 difference. So, even though a language identifier may be sufficient in
 many cases to name a locale, it is IMO very unhelpful to refer to RFC
 3066 tags as locale identifiers as it perpetuates and leads people into
 wrong assumptions. Please help improve common understanding by not
 referring to them as locale IDs.

I disagree. There is, as I have said, a perfectly reasonable, narrow sense of
locale which is essentially identical to what is captured by RFC 3066. And in
practice, RFC 3066 is often used with that meaning. I don't see any need to deny
reality (at least not in this area ;-)

As I said before, for a broader sense of locale, RFC 3066 is not sufficient to
capture everything that anyone has meant by that term.




  ISO 639 is not unstable. It is an open code set that is being added
 to
  over time, but I don't think that should be referred to as unstable --
  that term suggests other things.
 
  ISO 3066 has *demonstrated* instability,

 I take it you mean ISO 3166? I did not make any claim in that regard.

My typo: I meant ISO 3166.



  However, there is no policy documented
  *anywhere* that
  says they won't.

 I'm working on it. The ISO 639/RA-JAC has acknowledged the need for
 stability. Getting into the normative text of the standards takes a
 little time.

That's great -- any way we can help with that?



 Peter Constable






Re: Common Locale Data Repository Project

2004-04-23 Thread Antoine Leca
On Friday, April 23, 2004 7:02 AM
Peter Constable [EMAIL PROTECTED] va escriure:

 due to the strong perception of OpenI18N.org as
 opensource/Linux advocates, even though CLDR project is not
 specifically bound to Linux.

 It is hard to look at OpenI18N.org's spec and not get the impression
 that all of that group's projects are not bound to some flavour of
 Unix.

While CLDR certainly originates _from_ the Linux community, it is not
_bound_ to it. That is, as far as I understand, it is the same datas as what
use ICU, and to my knowledge, ICU runs also on Windows, which is under no
way bound to [that] flavour of Unix.

Or are you saying that, in as much some are advocating that everything from
Microsoft is so much evil that one should not even touch it, everything that
originates from Linux is not pure enough to be run on other systems?  :-)


 The Scope clause for several sections are specifically
 expressed in terms of Unix-related implementations (e.g. having the
 scope for rendering requirements expressed as what is needed for X
 Window).

Where are these clauses?
By the way, X Window, while Unix-related, is not bound to it. For example, I
ran for years a X client on a Windows desktop OS, with the server running on
another non-Unix machine. In fact, we did that because the equivalent
technology from Microsoft was at the time, emh, not very mature...


 And even if a section isn't scoped specifically in terms of a
 Unix-derived platform, it may specify requirements that are explicitly
 related to Unix implementations (e.g. that base libraries must support
 POSIX i18n environment variables).

Again, where is it said that CLDR require any form of base libraries, much
less one that support POSIX variables?


Antoine




Re: Common Locale Data Repository Project

2004-04-23 Thread Philippe Verdy
From: Antoine Leca [EMAIL PROTECTED]
  And even if a section isn't scoped specifically in terms of a
  Unix-derived platform, it may specify requirements that are explicitly
  related to Unix implementations (e.g. that base libraries must support
  POSIX i18n environment variables).

 Again, where is it said that CLDR require any form of base libraries, much
 less one that support POSIX variables?

POSIX variables are normally part of most implementations of languages supported
on Windows and MacOS too.
It's true that Windows and MacOS has deprecated the use of environment variables
for system-wide configuration or user settings, but this does not mean that this
environment cannot be emulated within a program by a support library. This is
already happening in Java when it is started on Windows.

What is needed in fact is the support of an API in POSIX, but not a particular
system feature. The Java Locale class for example is a minimum implementation
API to support POSIX locales. But it could become more rich later.

In fact if ISO 3066 is later standardized, the designation and use of locales
could become its own API supporting standard identifiers. In fact the exact
syntax of compound locale identifiers appears to me just as a parsable
serialization of a more complete LocaleID object. On Windows and MacOS these
identifiers can be translated to/from native system identifiers. With the CLDR
data, this mapping of locale ids could become more documented and more stable.

I think that the CLDR database is extremely important for software
implementations, because it avoids some caveats that come from other unstable
standards such as ISO 3166 and ISO 639.

But as this CLDR data will still need to adapt itself to new changes in ISO 3166
(countries and territories will probably continue to change their status, may
merge or split...) and ISO 639 (some new languages may become standardized),
what is needed is another level of abstraction to allow accessing to locale data
using older identifiers using some standardized locale resolution algorithm.
Java has such a basic algorithm, which is a bit richer in ICU; if this algorithm
should be tunable by user-settings or by a program, these tunings that control a
locale resolution should be documented as well (notably when mapping from a
locale identifier supported on one system onto another locale identifier on
another system, when the localization resources are not completely identical
between those systems).

What can ease the interchange of locale-sensitive data and methods is the
standardization of a common data encoding (Unicode), common values (CLDR locale
identifiers). So I approve the migration from OpenI18n.org to Unicode.org which
will ease the interoperability of systems and interchange of internationalized
data.




Re: Common Locale Data Repository Project

2004-04-23 Thread Hideki Hiura - [EMAIL PROTECTED]
 From: Peter Constable [EMAIL PROTECTED]
  due to the strong perception of OpenI18N.org as
  opensource/Linux advocates, even though CLDR project is not
  specifically bound to Linux.
 It is hard to look at OpenI18N.org's spec and not get the impression
 that all of that group's projects are not bound to some flavour of Unix.

We understand what you mean. Sometime perception is very important, 
and that's why we thought it was a good idea to transfer CLDR.

As we started as Linux Internationalization Initiative(li18nux.org) and
later changed name and charter as OpenI18N.org to accommodate wider
platforms and platform neutral I18N technology developments, any
projects at OpenI18N.org are not limited to Linux/Unix.

 CLDR doesn't have to be tied to any particular platform -- after all,
 it's just a collection of data.

Yup! So hopefully this move would help more parties to join the
projects.
That would definitely help global interoperability for all platforms
and help everybody.

 But I don't think you can honestly say that OpenI18N isn't tied to a
 particular family of platforms

Most of our current projects are mainly for some flavour of Unix,
since most of the participants' expertise and interests are for those
platforms but we are not limited nor have to be bound to them.

The only requirement for the projects in OpenI18N.org is to be open to
everyone, to be developed in open process and to be opensourced.

For example, one of the projects I run, the platform neutral
multilingual distributed Unicode input method framework, IIIMF, runs
on Windows as well, and I honestly hope Microsoft to adapt to IIIMF in
the future release of Windows, so that we can unite unicode input
method framework regardless of platform.

Best Regards,
--
[EMAIL PROTECTED],OpenI18N.org,li18nux.org,unicode.org,sun.com} 
Chair, OpenI18N.org/The Free Standards Group  http://www.OpenI18N.org
Architect/Sr. Staff Engineer, Sun Microsystems, Inc, USA   eFAX: 509-693-8356



Re: Common Locale Data Repository Project

2004-04-23 Thread Mark Davis
You are talking about Locale IDs. There is currently work underway on an RFC to
replace 3066 (this is referenced by UTS #35), and one of the features is
stability -- even where the ISO standards are not.

See:

...
http://www.ietf.org/internet-drafts/draft-phillips-langtags-02.txt
http://www.ietf.org/internet-drafts/draft-phillips-langtags-02.pdf

It is also available in HTML format on my private website here:

http://www.inter-locale.com/ID/draft-phillips-langtags-02.html

I will also be posting our issues list with resolutions and a link to the recent
presentation by Mark and myself at the Unicode conference on that site.

This version contains a few changes based on discussion on this list, notably it
more closely defines the rules for using UN M49 identifiers to resolve
ambiguity. It also contains semi-substantial wordsmithing in section 2 which is
not substantive, but which does make the rules (we think) clearer and easier to
understand.

Best Regards,

Addison

Mark
__
http://www.macchiato.com
  

- Original Message - 
From: Philippe Verdy [EMAIL PROTECTED]
To: Unicode List [EMAIL PROTECTED]
Sent: Fri, 2004 Apr 23 02:58
Subject: Re: Common Locale Data Repository Project


 From: Antoine Leca [EMAIL PROTECTED]
   And even if a section isn't scoped specifically in terms of a
   Unix-derived platform, it may specify requirements that are explicitly
   related to Unix implementations (e.g. that base libraries must support
   POSIX i18n environment variables).
 
  Again, where is it said that CLDR require any form of base libraries, much
  less one that support POSIX variables?

 POSIX variables are normally part of most implementations of languages
supported
 on Windows and MacOS too.
 It's true that Windows and MacOS has deprecated the use of environment
variables
 for system-wide configuration or user settings, but this does not mean that
this
 environment cannot be emulated within a program by a support library. This is
 already happening in Java when it is started on Windows.

 What is needed in fact is the support of an API in POSIX, but not a particular
 system feature. The Java Locale class for example is a minimum implementation
 API to support POSIX locales. But it could become more rich later.

 In fact if ISO 3066 is later standardized, the designation and use of locales
 could become its own API supporting standard identifiers. In fact the exact
 syntax of compound locale identifiers appears to me just as a parsable
 serialization of a more complete LocaleID object. On Windows and MacOS these
 identifiers can be translated to/from native system identifiers. With the CLDR
 data, this mapping of locale ids could become more documented and more stable.

 I think that the CLDR database is extremely important for software
 implementations, because it avoids some caveats that come from other unstable
 standards such as ISO 3166 and ISO 639.

 But as this CLDR data will still need to adapt itself to new changes in ISO
3166
 (countries and territories will probably continue to change their status, may
 merge or split...) and ISO 639 (some new languages may become standardized),
 what is needed is another level of abstraction to allow accessing to locale
data
 using older identifiers using some standardized locale resolution algorithm.
 Java has such a basic algorithm, which is a bit richer in ICU; if this
algorithm
 should be tunable by user-settings or by a program, these tunings that control
a
 locale resolution should be documented as well (notably when mapping from a
 locale identifier supported on one system onto another locale identifier on
 another system, when the localization resources are not completely identical
 between those systems).

 What can ease the interchange of locale-sensitive data and methods is the
 standardization of a common data encoding (Unicode), common values (CLDR
locale
 identifiers). So I approve the migration from OpenI18n.org to Unicode.org
which
 will ease the interoperability of systems and interchange of internationalized
 data.







RE: Common Locale Data Repository Project

2004-04-23 Thread Peter Constable
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf
 Of Mark Davis


 You are talking about Locale IDs. There is currently work underway on
an RFC to
 replace 3066

But let me reiterate from my correction to Philippe: even the
replacement of RFC 3066 is a specification for *language*
identification, not *locale* identification.


Peter
 
Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division




RE: Common Locale Data Repository Project

2004-04-23 Thread Peter Constable
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf
 Of Philippe Verdy

 In fact if ISO 3066 is later standardized, the designation and use of
locales
 could become its own API supporting standard identifiers. 

I really don't want to get into this discussion but can't let this point
go by: RFC 3066 (not ISO) is not a specification for *locale*
identification. It is a specification for *language* identification.
There are many possible cases in which this distinction is very
important.


 I think that the CLDR database is extremely important for software
 implementations, because it avoids some caveats that come from other
unstable
 standards such as ISO 3166 and ISO 639.

ISO 639 is not unstable. It is an open code set that is being added to
over time, but I don't think that should be referred to as unstable --
that term suggests other things.


Peter
 
Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division



RE: Common Locale Data Repository Project

2004-04-23 Thread Michael Everson
At 16:18 -0700 2004-04-23, Peter Constable wrote:

But let me reiterate from my correction to Philippe: even the
replacement of RFC 3066 is a specification for *language*
identification, not *locale* identification.
And it is to supercede RFC 3066, with a new edition. That's different 
from replacing.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



RE: Common Locale Data Repository Project

2004-04-23 Thread Mike Ayers
Title: RE: Common Locale Data Repository Project






 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of Michael Everson
 Sent: Friday, April 23, 2004 4:31 PM


 At 16:18 -0700 2004-04-23, Peter Constable wrote:
 
 But let me reiterate from my correction to Philippe: even the
 replacement of RFC 3066 is a specification for *language*
 identification, not *locale* identification.
 
 And it is to supercede RFC 3066, with a new edition. That's different 
 from replacing.


 Furthermore, in the IETF document architecture, the only way to amend an RFC is by a superceding RFC. RFCs are superceded all the time.


/|/|ike





Re: Common Locale Data Repository Project

2004-04-23 Thread John Cowan
Mike Ayers scripsit:

   Furthermore, in the IETF document architecture, the only way to
 amend an RFC is by a superceding RFC.  RFCs are superceded all the time.

Almost.  It's also possible for an RFC to update older RFCs without
superseding them completely.  For example, RFC 2396 (URI syntax)
updates RFC 1766 (URL syntax), leaving the parts specific to
certain URL schemes still in effect.

-- 
John Cowan   www.reutershealth.com   www.ccil.org/~cowan   [EMAIL PROTECTED]
Lope de Vega: It wonders me I can speak at all.  Some caitiff rogue did
rudely yerk me on the knob, wherefrom my wits still wander.
An Englishman: Ay, a filchman to the nab betimes 'll leave a man  
crank for a spell. --Harry Turtledove, Ruled Britannia



Re: Common Locale Data Repository Project

2004-04-23 Thread Philippe Verdy
From: Peter Constable [EMAIL PROTECTED]
  I think that the CLDR database is extremely important for software
  implementations, because it avoids some caveats that come from other
 unstable
  standards such as ISO 3166 and ISO 639.

 ISO 639 is not unstable. It is an open code set that is being added to
 over time, but I don't think that should be referred to as unstable --
 that term suggests other things.

By unstable I mean in fact ambiguous, even for the correct designation of
languages with a code that can be recognized. Even the proposal to supercede ISO
3066 with new tags has its caveats: which code must an application use when it
already defines multiple ones (is this number bound?) to refer to the same
language.

The problem comes within Softwares when a user will specify a prefered language
in his locale with a code that will not be understood by an application that
just understands another one. This becomes worse when one software will require
one code in the user's locale to support a language and another will require
another code in the user's locale to support the same language.

Look for example the case of Norwegian: is it no, nn or nb or no-nynorks or
no-bokmal ?
Even with the algorithm based on common prefixes, you won't be able to match
them all. So there's a need to specify an algorithms that allows aliases to be
resolved. With multi-tags language identifiers the resolution order becomes
unpredictable if one supports aliases for one subtag and not the other.

What is already unstable in ISO639 is the deprecation of iw and the addition
of he, same thing for in and id or for yi and ji. Don't you call that
unstability? OK these codes are deprecated, not reassigned. But they still cause
problems.

Think more recently about the new codification for Serbo-Croatian, and the split
of sh, with no definition except that it is country based (Serbian, Croatian,
Bosnian, Montenegrin), assimuming that one country uses only one language when
in fact there are several in the same one, that are shared by multiple
countries, and differ mostly by their script...

Also if ISO3166 is unstable (CS: is that the former Czechoslovakia or the newer
Serbia-Montenegro?), then it introduces unstability too within ISO 3066 or its
proposed replacement... for the indentification of languages.

For now, the only workable solution to solve these issues is found in
supplementary libraries in ICU which support locale aliases. (Yes I use the
terme Locale because this is the term that Java gives to this identification,
based on a language code consisting into a single subtag, a country/territory
code and a variant code with possibly multiple subtags, and no reference to the
needed script code; I wonder how the newer RFC 3066 model will fit here).




Re: Common Locale Data Repository Project

2004-04-23 Thread John Cowan
Philippe Verdy scripsit:

 By unstable I mean in fact ambiguous, even for the correct designation
 of languages with a code that can be recognized. Even the proposal to
 supercede ISO 3066 with new tags has its caveats: which code must an
 application use when it already defines multiple ones (is this number
 bound?) to refer to the same language.

RFC 3066 always requires that the 2-letter code be used in place of either
3-letter code if it exists.  In all other cases, there is only one 3-letter
code, and it is used.

Some codes are vague, in the sense that they do not fully specify which
language is in use.  For that reason, ISO 639-3 is being defined as an
upward compatible extension of ISO 639-2.

 Look for example the case of Norwegian: is it no, nn or nb or no-nynorks or
 no-bokmal ?

There are two issues here:  no-nynorsk and no-bokmal are now deprecated
codes: that is, no application should require them, every application
thta accepts nn or nb should accept them, no application should produce
them.  Older versions will be less forgiving and should be upgraded.

The second is that no is unique, or nearly so: it designates nn and nb
jointly.  Now everyone who can read one can read the other, so Norwegian
applications should accept any of no, nb, nn in data.  But no is meaningless
to a spell-checker, which should require either nb or nn.

 What is already unstable in ISO639 is the deprecation of iw and
 the addition of he, same thing for in and id or for yi and
 ji. Don't you call that unstability? OK these codes are deprecated,
 not reassigned. But they still cause problems.

Not really.  Again, all applications should generate he and accept
both iw and he.

 Also if ISO3166 is unstable (CS: is that the former Czechoslovakia
 or the newer Serbia-Montenegro?), then it introduces unstability too
 within ISO 3066 or its proposed replacement... for the indentification
 of languages.

ISO 3066bis specifies that CS will always mean Czechoslovakia, and the
highly stable 3-digit code will be used for Serbia-Montenegro.


 For now, the only workable solution to solve these issues is found in
 supplementary libraries in ICU which support locale aliases. (Yes I
 use the terme Locale because this is the term that Java gives to this
 identification, based on a language code consisting into a single
 subtag, a country/territory code and a variant code with possibly
 multiple subtags, and no reference to the needed script code; I wonder
 how the newer RFC 3066 model will fit here).

Language specifiers are conceptually different from locale specifiers.
One might specify a locale of da_us to mean Danish language, U.S.
measurement systems, but the language da-us would be the U.S. dialect
of Danish, a very different thing.

-- 
John Cowan  www.ccil.org/~cowan  www.reutershealth.com  [EMAIL PROTECTED]
In might the Feanorians / that swore the unforgotten oath
brought war into Arvernien / with burning and with broken troth.
and Elwing from her fastness dim / then cast her in the waters wide,
but like a mew was swiftly borne, / uplifted o'er the roaring tide.
--the Earendillinwe



Re: Common Locale Data Repository Project

2004-04-23 Thread Mark Davis
You can reiterate it all you want; in practice, 3066 tags are used as locale
identifiers. And for a narrow sense of locales, that is perfectly reasonable.
For a broad sense of locale, including timezone, user's currency, religious
preference, etc., it clearly would not be reasonable, and I would agree with you
for that.

ISO 639 is not unstable. It is an open code set that is being added to
over time, but I don't think that should be referred to as unstable --
that term suggests other things.

ISO 3066 has *demonstrated* instability, because they remove codes, then reuse
those codes for different entities. It'd be like our removing a character, then
later putting a different character in that spot*.

ISO 639 has not yet *demonstrated* instability. They have removed codes, but
since they haven't reused them, one can handle that with an alias table, keeping
all the old codes usable. However, there is no policy documented *anywhere* that
says they won't. As long as they don't have that, and given the demonstrated
instability in ISO 3066, the standard simply cannot be trusted to be stable in
the future.

* Yes, I know we did that for Korean, when we were first getting started. But we
learned from that, and put into place firm policies against that ever happening
in the future. We have no such assurances from ISO, for some pretty key
components: language codes, country codes, currency codes, or script codes.

Mark
__
http://www.macchiato.com
  

- Original Message - 
From: Peter Constable [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]; Philippe Verdy [EMAIL PROTECTED];
Unicode List [EMAIL PROTECTED]
Sent: Fri, 2004 Apr 23 16:18
Subject: RE: Common Locale Data Repository Project


 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf
 Of Mark Davis


 You are talking about Locale IDs. There is currently work underway on
an RFC to
 replace 3066

But let me reiterate from my correction to Philippe: even the
replacement of RFC 3066 is a specification for *language*
identification, not *locale* identification.


Peter

Peter Constable
Globalization Infrastructure and Font Technologies
Microsoft Windows Division





Re: Common Locale Data Repository Project

2004-04-23 Thread John Cowan
Mark Davis scripsit:

 ISO 3066 has *demonstrated* instability, because they remove codes,
 then reuse those codes for different entities. It'd be like our removing
 a character, then later putting a different character in that spot*.

That's ISO 3166, of course, not RFC 3066.

-- 
Eric Raymond is the Margaret Mead   John Cowan
of the Open Source movement.[EMAIL PROTECTED]
--Bruce Perens, http://www.ccil.org/~cowan
  some years agohttp://www.reutershealth.com



Re: Common Locale Data Repository Project

2004-04-22 Thread Philippe Verdy
From: Rick McGowan [EMAIL PROTECTED]
 The Unicode Consortium announced today that it will be hosting the Common
 Locale Data Repository project, providing key building blocks for software
 to support the world's languages.

 For more information and links to the project pages, please see:

 http://www.unicode.org/press/press_release-cldr.html

Is that a contribution of the Unicode Consortium to the OpenI18n.org project
(former li18nux.org, maintained with most help from the FSF), or a decision to
make the OpenI18n.org project be more open by pushing it to a more visible
standard?

In that case, I'm surprised to see that the preliminary pages on the
Unicode.org's CLDR project defines it as a UTS (Standard) when it is a revizion
of a previously published released 1.0 of LDML, plus the repository which is
still hosted in the IBM's ICU project repository...

Some confusion will occur for now if the CLDR pages reference a UTS (standard)
rather than a UTR, which it should still be now, until there's a final approval
as a standard (don't forget the Microsoft vote here, as it is camaigning a lot
against Linux, which was the base platform from which the Openi18n.org project
was born. Also the only certified platform for Openi18n.org is RedHat, a Linux
platform...

Will Microsoft endorse this addition into the domain of Unicode.org? I hope so,
if this can help improve interoperability of platforms in this domain. I also
hope that IBM will continue his woderful support for the CLDR collection of data
for the repository, and that Microsoft and others will contribute too to make
this important repository a key element for the convergence of platforms.

May be this collaborative and richer standard will bring to the final approval
of the unfinished ISO 3066 standard which developers and users want since so
long...

What will happen to the discussion lists on openi18n.org? Will it be easy to
contribute locale data or to submit bug reports as it was in the past? I'm sure
that the Unicode subcommitee that will take in charge the CLDR will need a new
policy to accept new members using also their own technical solutions.

At least I see a good point here if Openi18n.org merges with Unicode's goals:
Unicode has now a concrete application of its standard (for example the CLDR
will contain what has always been missing in Unicode: a clear definition of its
usage with concreate languages and locales; so Unicode will not ignore the
specific issues that come with some languages)




Re: Common Locale Data Repository Project

2004-04-22 Thread Ernest Cline


 From: Philippe Verdy [EMAIL PROTECTED]

 From: Rick McGowan [EMAIL PROTECTED]
  The UnicodeĀ® Consortium announced today that it will be hosting the
  Common Locale Data Repository project, providing key building blocks
  for software to support the world's languages.

 Is that a contribution of the Unicode Consortium to the OpenI18n.org
project
 (former li18nux.org, maintained with most help from the FSF), or a
decision to
 make the OpenI18n.org project be more open by pushing it to a more visible
 standard?

 In that case, I'm surprised to see that the preliminary pages on the
 Unicode.org's CLDR project defines it as a UTS (Standard) when it is a
revizion
 of a previously published released 1.0 of LDML, plus the repository which
is
 still hosted in the IBM's ICU project repository...

Given its pre-Unicode history, I'd say that it clearly fits within the
realm of
a UTS.  As such, Microsoft or any other vendor is free to ignore or support
it as much as they wish as its impact upon Unicode per se is none.  For me,
the interesting thing to see will be how it affects ECMAScript.  For a long
time,
several of its functions have reserved, but not made use of a locale
argument.
If this standard takes off, ECMAScript may finally have something to use in
its
next version, whatever that ends up being.

However, a bigger question emerges with the release of the draft version
of UTS 35.  What happened to TR 33 and TR 34?  Indeed, what are they?
Something must be at least tentatively planned for those numbers, but
there isn't anything available publicly at least.





Re: Common Locale Data Repository Project

2004-04-22 Thread Hideki Hiura - [EMAIL PROTECTED]
 From: Philippe Verdy [EMAIL PROTECTED]
 Is that a contribution of the Unicode Consortium to the OpenI18n.org
 project (former li18nux.org, maintained with most help from the
 FSF), or a decision to make the OpenI18n.org project be more open by
 pushing it to a more visible standard?

More on the latter, but slightly different. We believe it would be
good for both opensource community and commercial IT industry that we
transfer (at least a part of) the project to Unicode Consortium, after
hearing the concerns on difficulty of some commercial companies to
join the project due to the strong perception of OpenI18N.org as
opensource/Linux advocates, even though CLDR project is not
specifically bound to Linux.

We hope this transfer would gain further participations from wider
audiences.

Regarding confusions, I have to say it is anticipated, since the project
is still in transition(for example, OpenI18N.org side has not been
finished necessary procedure to finalize this, so OpenI18N.org does
not have a press release statement ready yet - this announcement is a
little too early), I guess it will all be sorted out as time goes by. 

--
[EMAIL PROTECTED],OpenI18N.org,li18nux.org,unicode.org,sun.com} 
Chair, OpenI18N.org/The Free Standards Group  http://www.OpenI18N.org
Architect/Sr. Staff Engineer, Sun Microsystems, Inc, USA   eFAX: 509-693-8356




Re: Common Locale Data Repository Project

2004-04-22 Thread Kenneth Whistler

 However, a bigger question emerges with the release of the draft version
 of UTS 35.  What happened to TR 33 and TR 34?  Indeed, what are they?
 Something must be at least tentatively planned for those numbers, but
 there isn't anything available publicly at least.

Working drafts of some material that may (and should) end up as
UTR's eventually.

UTR numbers are assigned sequentially, and not all documents
progress with equal speed. When UTR's 32, 33, and 34 progress to
the point where there is consensus that they are in good
enough states to open them for general public comment as
public drafts, they will, in due time, get posted along
with the other drafts.

As the CLDR documentation mentions, UTS #35 is being moved along
particularly quickly, since it is effectively an inherited
specification from another project. It is already quite mature.

--Ken




RE: Common Locale Data Repository Project

2004-04-22 Thread Peter Constable
 due to the strong perception of OpenI18N.org as
 opensource/Linux advocates, even though CLDR project is not
 specifically bound to Linux.

It is hard to look at OpenI18N.org's spec and not get the impression
that all of that group's projects are not bound to some flavour of Unix.
The Scope clause for several sections are specifically expressed in
terms of Unix-related implementations (e.g. having the scope for
rendering requirements expressed as what is needed for X Window).

And even if a section isn't scoped specifically in terms of a
Unix-derived platform, it may specify requirements that are explicitly
related to Unix implementations (e.g. that base libraries must support
POSIX i18n environment variables).

CLDR doesn't have to be tied to any particular platform -- after all,
it's just a collection of data. But I don't think you can honestly say
that OpenI18N isn't tied to a particular family of platforms. Or, at
least, I can say that when I last looked at the OpenI18N site, it sure
looked like it was tied to a particular family of platforms.



Peter Constable