Re: [dev] BCP-47 based proposal for "IsoStrings", Locale Variants and describing languages ?

Caolán McNamara Thu, 22 Apr 2010 09:53:00 -0700

On Thu, 2010-04-22 at 17:03 +0200, Eike Rathke wrote:

> Actually the aImplIsoNoneStdLangEntries can never be the result of
> a conversion as valid ISO code combinations exist for all LangIDs in
> aImplIsoLangEntries. The corresponding code in both
> MsLangId::convertLanguageToIsoNames() methods is moot. I don't recall if
> it was ever used that way since we write XML, but I doubt it.


> Accepting makes of course sense, but a conversion will always result
> in the corresponding ISO codes.

Set some text to Azeri (Cyrillic) in writer with 3.2 and save as .odt,
the result is <style:text-properties fo:language="az"
fo:country="cyrillic"/>

> > The need for conversion from a Unix locale string in rtl to a
> > rtl_Locale and back again is what bothers me, e.g. de_BE.Euctw
> > which currently will give
> > rtl_Locale of...
> > Language = de
> > Country = BE
> > Variant = Euctw
> > 
> > If a future iso-15924 adds Euctw as a script code, then there's a
> > problem.
> 
> They should not, ISO 15924 alpha is defined to be a 4 letter code.
> Anyway, a script code in the BCP47 context would have to be registered
> with IANA, and they certainly (hopefully..) would reject a non-4-letter
> code.

Woops, right, I used an invalid 5 letter example. Anyway, checking for 4
letter encodings which plausibly could show up in a Unix locale, take
LANG=ja_JP.Sjis as a better example.

> > The other consideration is that if you enforce a script code as the
> > first tag in a Variant, it becomes trivial to pull out the script tag
> > from a Variant string with a two liner without any other processing,
> > e.g.
> > 
> > sal_Int32 nIndex = 0;
> > rtl::OUString aScriptSubtag = rVariant.getToken(0, '-', nIndex);
> 
> That's indeed neat. But again, see my previous mail, not all BCP47 tags
> would fulfill this requirement if they contained extlang subtags.

I had sort of imagined something like zh-cmn-Latn-CN would appear as
Language = zh-cmn
Country = CN 
Variant = Latn

> As a quick solution I'd come up with:
> 
> * Divide Variant into three subfields, separated by ':' colon.
> * First subfield is either a 4 letter script code followed by '-', or
>   only '-' to indicate absence of script.
>   * This enables the extraction with rVariant.getToken(0, '-', nIndex).
> * Second subfield is a full BCP47 string in case Language is "x-bcp47"
>   or a BCP47 variant is involved, otherwise empty.
> * Third subfield is the _full_ Unix locale string, or empty.
>   * _compose_locale() could extract this with
>     rVariant.getToken(2, ':', nIndex)
> * Variant can be empty.
>   * Extraction of script code still delivers a null string.
>   * _compose_locale() in this case will have to concatenate
>     Language-Country as it currently does.

Sounds good.

> * If only a BCP47 variant is involved, with or without script, we could
>   add the variant to the first subfield, having
>   '-' [script] '-' [variant]
>   for easier extraction with rVariant.getToken(1, '-', nIndex).

Sounds like gilding the lily. Do we really need to easily extract that,
and anyway can't there be multiple BCP47 variant tags as opposed to only
one script tag ?

> And, maybe, using such a Locale with Java might lead to unpredictable
> results, I don't know.

It would definitely help if anyone knew what on earth the java Variant
field ever gets used for.

C.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@openoffice.org
For additional commands, e-mail: dev-h...@openoffice.org

Re: [dev] BCP-47 based proposal for "IsoStrings", Locale Variants and describing languages ?

Reply via email to