[Stdcxx Wiki] Update of "LocaleLookup" by TravisVitek

Apache Wiki Wed, 06 Feb 2008 12:04:13 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Stdcxx Wiki" for change 
notification.


The following page has been changed by TravisVitek:
http://wiki.apache.org/stdcxx/LocaleLookup

New page:
[[Anchor(Definitions)]]
= Definitions =

canonical language code: The <language> field is two lowercase characters that 
represent the language as defined by [#References ISO-639].

canonical country code: The <COUNTRY> field is two uppercase letters that 
represent the country as defined by [#References ISO-3166].

canonical codeset code: The <CODESET> field is a string describing the encoding 
character set. For our purposes, the codeset is the preferred MIME name of the 
codeset as defined by [#References IANA].

canonical locale name: A complete locale name in the format 
<language>_<COUNTRY>.<CODESET>. Each field uses the canonical representation 
described above. [ex. en_US.ISO-8859-1]

native locale name: The locale name used by the local operating system. [ex. 
English_United States.1252, en]

locale locale name: See native locale name.

[[Anchor(Plan)]]
= Plan =

This page relates to the issue described at 
http://issues.apache.org/jira/browse/STDCXX-608. There has been some discussion 
both on and off the dev@ list about how to proceed. This page is here to 
document what has been discussed.

The idea behind the issue is to create some mechanism for querying the list of 
installed locales, selecting those that match given criteria.

[[Anchor(2008_01_28)]]
= Discussion 2008/01/28 =

The idea is to take a regular expression like query string, do a brace 
expansion to get several simpler regular expressions, and then search the list 
of installed locales for matches.

Given a query string 

  {en,fr,*}_{CA,US,FR,CN}.*

we would apply brace expansion to get the following list of expressions

  en_CA.*
  en_US.*
  en_FR.*
  en_CN.*
  fr_CA.*
  fr_US.*
  fr_FR.*
  fr_CN.*
   *_CA.*
   *_US.*
   *_FR.*
   *_CN.*

Once we have this list of expressions, we would enumerate all of the installed 
locales, and then search through them looking for locale names that match one 
of those regular expressions. The actual matching would be done using 
rw_fnmatch().

Every platform has a unique list of locales available. For example, Windows 
sytems use 'English' as a language name, but most *nix systems the canonical 
'en' or in some cases 'EN'. This problem exists for the language, country and 
codeset fields of the locale name. To deal with this, we need to provide a 
mapping between the native names and the canonical names that we plan to use in 
the query string. It has been suggested that the mapping give a list of all 
known native locale names for each canonical locale name. The current 
suggestion is to provide one table with a list of all native locale names and 
the canonical names for all platforms. For efficiency, it was decided that this 
table include other information that may be useful such as MB_CUR_LEN for each 
of those locales.

When we enumerate the list of installed locales we would use this data to map 
the locally installed locale name to the canonical locale name. For lookup 
purposes we use the canonical name, and once we've found a match, we provide 
the native locale name back to the user.

[[Anchor(Issues)]]
= Issues =

Now that I'm collecting the list of installed locales to build up this table, 
I've noticed a few issues with the name mapping. One issue is that a single 
native locale name may map to a different canonical locale name on different 
platforms. For example, `es_BO' maps to `es_BO.ISO-8859-15' on AIX, but it maps 
to `es_BO.ISO-8859-1' on Linux and SunOS. Another issue is that the data 
associated with each of the canonical locales, like MB_CUR_LEN, is different on 
each platform. The ar_DZ.UTF-8 locale uses a 6 byte codeset on Linux, but a 4 
byte codeset on other platforms.

Options...

I can provide one database per-platform that includes all of the locale 
information for that platform. I could write a utility to create this file for 
each platform. I could even opt to use this file as the list of installed 
locales instead of checking the output of `locale -a'. The disadvantage is that 
the data would have to be verified or completed manually to handle mapping 
native locales names like 'czech' to a canonical name. Maybe we could skip 
these? If so, then maybe we could generate this file on the fly before running 
any tests.

Another option would be to have a seperate mapping for each of the locale name 
components. That makes it possible to from 'English' to 'en' or from 'iso88591' 
to 'ISO-8859-1' so I can build up the complete canonical locale name with each 
of the canonical locale name components. The disadvantage with this is that I 
may have trouble mapping from locales names like 'czech' to a single canonical 
name. Maybe I should skip these?

[[Anchor(References)]]
= References =

discussion 
[http://www.nabble.com/low-hanging-fruit-while-cleaning-up-test-failures-to13634803.html#a15137334]

std-country (ISO-3166) 
[http://www.iso.org/iso/english_country_names_and_code_elements]

std-lang (ISO-639) [http://www.loc.gov/standards/iso639-2/php/English_list.php]

std-codeset [http://www.iana.org/assignments/character-sets]

[Stdcxx Wiki] Update of "LocaleLookup" by TravisVitek

Reply via email to