On Wed, 2010-04-21 at 12:11 +0200, Stephan Bergmann wrote: > On 04/21/10 10:10, Caolán McNamara wrote: > > So..., how about we adopt a BCP-47 based approach. i.e. > > > > a) Where we are currently describing locales as a string in "iso-format" > > we use BCP-47. Currently valid locale strings get to remain valid. > > "get to remain valid": Is it the case that all currently valid locale > strings happen to adhere to the BCP-47 restrictions, so would > automatically be valid BCP-47 strings.
Well, there is one problem. aImplIsoNoneStdLangEntries in i18npool/source/isolang/isolang.cxx has... { LANGUAGE_SERBIAN_LATIN, "sr", "latin" }, { LANGUAGE_SERBIAN_CYRILLIC, "sr", "cyrillic" }, { LANGUAGE_AZERI_LATIN, "az", "latin" }, { LANGUAGE_AZERI_CYRILLIC, "az", "cyrillic" }, Considering the way the various tables work, that means there is one combination of language of "az" and country of "cyrillic" which has escaped out into the file format as fo:language="az" fo:country="cyrillic", so az-cyrillic would have to be accepted in addition though it's not valid BCP-47. Given that, it makes sense to continue to accept as input the other entries in that above table and aImplIsoNoneStdLangEntries2 + aImplOtherEntries as acceptable input for an "iso-string" (though they never were generated as output). So BCP-47 + some extra grandfathered tags. > Is the requirement "that the first tag entry *must* be a Script Code to > ensure forward and backward conversion to an unambiguous BCP-47 string" > really necessary? A <langtag> w/o <language> and <region> parts would be > > [script] *("-" variant) *("-" extension) ["-" privateuse] > > where the syntactic forms allowed for <script> are disjoint of those > allowed for <variant>, <extension>, and <privateuse>. The need for conversion from a Unix locale string in rtl to a rtl_Locale and back again is what bothers me. Following the above protects against converting a unknown existing or future Unix locale string into a rtl_Locale which if used anywhere following this convention gives incorrect results. e.g, there are some glibc locales like zh_TW.euctw so LANG=zh_TW.Euctw is acceptable which currently will give rtl_Locale of... Language = de Country = BE Variant = Euctw If a future iso-15924 adds Euctw as a script code, then there's a problem. The other consideration is that if you enforce a script code as the first tag in a Variant, it becomes trivial to pull out the script tag from a Variant string with a two liner without any other processing, e.g. sal_Int32 nIndex = 0; rtl::OUString aScriptSubtag = rVariant.getToken(0, '-', nIndex); > > Parsers that want to convert a Unix Locale into the above structure can > > take, e.g. > > aa_er.ut...@saaho > > > > and make it into > > > > Language = aa > > Country = ER > > Variant = -.ut...@saaho > > > > to give a reversible scheme where the original Unix Locale string can be > > reconstructed > > Is reversibility necessary here? I ask because this makes the Variant > contain data that does not adhere to the above BCP-47 <langtag> w/o > <language> and <region> parts. I feel it is because if we look into sal/osl/unx/nlsupport.c and e.g. osl_getTextEncodingFromLocale there we use _compose_locale to regenerate from rtl_Locale a string to pass to setlocale(LC_CTYPE and some other similar examples in there. So it looks to me that a rtl_Locale that originates from _parse_locale on a given string has to be convertible back to that string in order to be useful with setlocale. C. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@openoffice.org For additional commands, e-mail: dev-h...@openoffice.org