On Wed, 2010-04-21 at 12:11 +0200, Stephan Bergmann wrote:
> On 04/21/10 10:10, Caolán McNamara wrote:
> > So..., how about we adopt a BCP-47 based approach. i.e.
> > 
> > a) Where we are currently describing locales as a string in "iso-format"
> > we use BCP-47. Currently valid locale strings get to remain valid.
> 
> "get to remain valid":  Is it the case that all currently valid locale 
> strings happen to adhere to the BCP-47 restrictions, so would 
> automatically be valid BCP-47 strings.

Well, there is one problem. aImplIsoNoneStdLangEntries in
i18npool/source/isolang/isolang.cxx has...

{ LANGUAGE_SERBIAN_LATIN,               "sr", "latin"    },
{ LANGUAGE_SERBIAN_CYRILLIC,            "sr", "cyrillic" },
{ LANGUAGE_AZERI_LATIN,                 "az", "latin"    },
{ LANGUAGE_AZERI_CYRILLIC,              "az", "cyrillic" },

Considering the way the various tables work, that means there is one
combination of language of "az" and country of "cyrillic" which has
escaped out into the file format as fo:language="az"
fo:country="cyrillic", so az-cyrillic would have to be accepted in
addition though it's not valid BCP-47.

Given that, it makes sense to continue to accept as input the other
entries in that above table and aImplIsoNoneStdLangEntries2 +
aImplOtherEntries as acceptable input for an "iso-string" (though they
never were generated as output). So BCP-47 + some extra grandfathered
tags.

> Is the requirement "that the first tag entry *must* be a Script Code to 
> ensure forward and backward conversion to an unambiguous BCP-47 string" 
> really necessary?  A <langtag> w/o <language> and <region> parts would be
> 
>    [script] *("-" variant) *("-" extension) ["-" privateuse]
> 
> where the syntactic forms allowed for <script> are disjoint of those 
> allowed for <variant>, <extension>, and <privateuse>.

The need for conversion from a Unix locale string in rtl to a rtl_Locale
and back again is what bothers me. Following the above protects against
converting a unknown existing or future Unix locale string into a
rtl_Locale which if used anywhere following this convention gives
incorrect results. e.g, there are some glibc locales like zh_TW.euctw so
LANG=zh_TW.Euctw is acceptable

which currently will give
rtl_Locale of...
Language = de
Country = BE
Variant = Euctw

If a future iso-15924 adds Euctw as a script code, then there's a
problem.

The other consideration is that if you enforce a script code as the
first tag in a Variant, it becomes trivial to pull out the script tag
from a Variant string with a two liner without any other processing,
e.g.

sal_Int32 nIndex = 0;
rtl::OUString aScriptSubtag = rVariant.getToken(0, '-', nIndex);

> > Parsers that want to convert a Unix Locale into the above structure can
> > take, e.g.
> > aa_er.ut...@saaho
> > 
> > and make it into
> > 
> > Language = aa
> > Country = ER
> > Variant = -.ut...@saaho
> > 
> > to give a reversible scheme where the original Unix Locale string can be
> > reconstructed
> 
> Is reversibility necessary here?  I ask because this makes the Variant 
> contain data that does not adhere to the above BCP-47 <langtag> w/o 
> <language> and <region> parts.

I feel it is because if we look into sal/osl/unx/nlsupport.c and e.g.
osl_getTextEncodingFromLocale there we use _compose_locale to regenerate
from rtl_Locale a string to pass to setlocale(LC_CTYPE and some other
similar examples in there. So it looks to me that a rtl_Locale that
originates from _parse_locale on a given string has to be convertible
back to that string in order to be useful with setlocale.

C.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@openoffice.org
For additional commands, e-mail: dev-h...@openoffice.org

Reply via email to