Hi Caolan,

On Wednesday, 2010-04-21 09:10:51 +0100, Caolan McNamara wrote:

> I note here that the rtl Locale parser converts a Unix locale string
> into a rtl::Locale along the lines of
>
> Language = language, e.g. sr
> Country = Country, e.g. RS
> Variant = all_the_rest_of_the_string, e.g. .ut...@latin
>
> and depends on being able to reverse this conversion back to the the
> original Unix locale string, i.e. it needs to rebuild sr_rs.ut...@latin
> from its rtl::Locale structure

And there it already fails because most code currently silently drops
the Variant field and cares only about Language and Country. With the
additional hurdle that most core code uses MS-LangIDs to attribute
a locale and thus information like .UTF-8 would be lost anyway.
For transport in an rtl Locale the approach of course works.

> In our xml format afaics, where the com::sun::star::lang::Locale is
> basically the structure that backs it, we have just "language" and
> "country" tags.

ODF 1.2 introduces the attributes fo:script, number:script and
table:script.

> So..., how about we adopt a BCP-47 based approach. i.e.

See also http://www.openoffice.org/issues/show_bug.cgi?id=109846

> a) Where we are currently describing locales as a string in "iso-format"
> we use BCP-47. Currently valid locale strings get to remain valid.
>
> b) Where we use a Locale structure, Language and Country stay the same,
> but we specify a format for the remaining Variant field where it is
> BCP-based sequence of tags separated by '-'. The Variant field becomes
> the equivalent BCP-47 locale string for the totality, minus the language
> and region tags, plus that the first tag entry *must* be a Script Code
> to ensure forward and backward conversion to an unambiguous BCP-47
> string. In this scheme the script tag at the start of the Variant can
> (and must) be empty to denote the default script.
>
> c) Where we use "language" and "country" codes in our xml format we add
> a "language-tags" attribute which maps directly to that Variant field.

Additionally to *:script attributes ODF 1.2 already introduces
number:rfc-language-tag, table:rfc-language-tag and
style:rfc-language-tag to store BCP47 language tags if a locale can't be
described as a combination of *:language *:country *:script


> i.e. sr-Latn-RS becomes
>
> Language = sr
> Country = RS
> Variant = Latn
>
> i.e. sr-Latn-RS-whatever-foo becomes
>
> Language = sr
> Country = RS
> Variant = Latn-whatever-foo
>
> a BCP-47 string of de-DE-1901 becomes
>
> Language = de
> Country = DE
> Variant = -1901

With the leading '-' indicating the default script?

> de-DE remains
>
> Language = de
> Country = DE
> Variant =
>
> Parsers that want to convert a Unix Locale into the above structure can
> take, e.g.
> aa_er.ut...@saaho
>
> and make it into
>
> Language = aa
> Country = ER
> Variant = -.ut...@saaho

Though this would not form a valid BCP47 tag when reconstructing.

> to give a reversible scheme where the original Unix Locale string can be
> reconstructed, and for Unix Locale strings which hint at the script in
> use, we can parse sr_rs.ut...@latin into
>
> Language = sr
> Country = RS
> Variant = latn-.ut...@latin
>
> and remain reversible into the original Unix Locale string,

Why does it need to be reversible? Without that requirement we could
drop information after Language-Country starting with '.', leaving

Language = sr
Country = RS
Variant = Latn

We should also prepare for transport of full BCP47 tags (see further
down), having this mix of script and Unix locale in the Variant field
somewhat makes me shudder.. I'd rather use the Variant here such that if
the content starts with a capital ASCII letter and is 4 characters it is
a script ISO 15924 code, else it is something different, to be defined.

> and also
> provide a non-null script tag which allows continued conversion from the
> rtl::Locale class to the com::sun::star::lang::Locale one without losing
> script tag information.
>
> The xml format for a style that sets the Language of a paragraph to
> Inuktitut Syllabics Canada could then use an additional language-tags
> attribute, e.g.
>
> <style:text-properties fo:language="iu" fo:country="CA"
> fo:language-tags="Cans"/>

In ODF 1.2 this could be written as

<style:text-properties fo:language="iu" fo:country="CA" fo:script="Cans"/>

> while the "Locales" string of the spellchecker Locales string can use
> BCP-47 format, e.g. support "iu-Cans-CA"
>
> 2. Presumably it would be best to prefer *generating* sh-RS for
> backwards compatibility, even though accepting sr-Latn-RS

Yes. We would need two conversions from MS-LangID then, one for document
storage that generates sh-RS and one for all other run time cases that
generates sr-Latn-RS. I'd prefer to switch to storing sr-Latn-RS in
a later release though, as most non-OOo applications probably don't
identify sh-RS in the sense we're using it..

> 3. comphelper::Locale is very little used, it looks like a good idea to
> move uses of it over to com::sun::star::lang::Locale and convert it to
> some calls that operate on that instead and/or merge the unused bits
> over to e.g. MSLangId.

Yes, we have too many places dealing with locales.


Future perspective: the syntax of RFC 5646 allows more complicated
language tags, not all can be fitted into Language/Country fields using
ISO 639-2/3 and ISO 3166-1 codes. For these we'd have to use some
notation to indicate the full BCP47 tag is to be used, having
Language=x-bcp47 and Variant=full_bcp47_string might do. Of course this
would affect all places that simply take the Language/Country fields as
ISO codes.

If an extended language subtag (extlang) came into play, the approach of
concatenating Language-Country-Variant wouldn't work anymore if we said
Variant had to start with the 4 letter script code or '-'.

As a near to mid term goal OOo should at least support language
(possibly without extlang subtags), region (country), script and variant
(in the BCP47 context).

Already support of BCP47 variants would need to use *:rfc-language-tag
in ODF, even if Language/Country are valid ISO codes that are to be
written as well.

Btw, anyone interested in BCP47/RFC5646 might want to take a look at the
links provided at http://www.erack.de/bookmarks/D.html#Language_Tags

  Eike

-- 
 OOo/SO Calc core developer. Number formatter stricken i18n transpositionizer.
 SunSign   0x87F8D412 : 2F58 5236 DB02 F335 8304  7D6C 65C9 F9B5 87F8 D412
 OpenOffice.org Engineering at Sun: http://blogs.sun.com/GullFOSS
 Please don't send personal mail to the [email protected] account, which I use for
 mailing lists only and don't read from outside Sun. Use [email protected] Thanks.

Attachment: pgph5ZFGn0yqf.pgp
Description: PGP signature

Reply via email to