[Stdcxx Wiki] Update of "LocaleLookup" by TravisVitek

Apache Wiki Mon, 10 Mar 2008 18:06:49 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Stdcxx Wiki" for change 
notification.


The following page has been changed by TravisVitek:
http://wiki.apache.org/stdcxx/LocaleLookup

------------------------------------------------------------------------------
  
  Once we have this list of expressions, we would enumerate all of the 
installed locales, and then search through them looking for locale names that 
match one of those regular expressions. The actual matching would be done using 
rw_fnmatch().
  
+ [[Anchor(Part1)]]
+ = Part 1 (STDCXX-714) =
+ 
+ The first thing that we needed was to write the function for doing name 
matching and add it to the test suite.. Martin has already added an 
implementation of 
[http://svn.apache.org/viewvc/stdcxx/trunk/tests/src/fnmatch.cpp rw_fnmatch](), 
so that is done.
+ 
+ The second thing that we needed was a function to do brace expansion. After 
much discussion, it was decided that the csh brace expansion rules made the 
most sense. Travis provided an implementation of two functions for doing brace 
expansion. The function 
[http://svn.apache.org/viewvc/stdcxx/trunk/tests/src/braceexp.cpp 
rw_brace_expand]() does a simple brace expansion on the input string. There is 
no special treatment for whitespace, but escapes are properly handled. The 
function [http://svn.apache.org/viewvc/stdcxx/trunk/tests/src/braceexp.cpp 
rw_shell_expand]() does whitespace tokenization and collapse, and then does 
brace expansion on each token, much like the behavior you would see from the 
csh shell.
+ 
+ Just for illustration, consider the following string.
+ 
+ {{{
+    a {1,2} b
+ }}}
+ 
+ If you passed this to rw_brace_expand, the result would be
+ 
+ {{{
+    a 1 b a 2 b
+ }}}
+ 
+ If you passed this to rw_shell_expand, the result would be
+ 
+ {{{
+    a 1 2 b
+ }}}
+ 
+ In most cases you would want to use rw_shell_expand(). '''Perhaps 
''rw_brace_expand'' should become an implementation function and the 
header/source/test should be renamed to 
shellexp.h/shellexp.cpp/0.shellexp.cpp''' 
+ 
+ [[Anchor(Part2)]]
+ = Part 2 (STDCXX-715) =
+ 
+ Every platform has a unique list of locales available. For example, Windows 
sytems use {{{English}}} as a language name, but most *nix systems the 
canonical {{{en}}} or in some cases {{{EN}}}. This problem exists for all 
fields of the locale name.
+ 
+ To deal with this, we need to provide a mapping between the native names and 
the canonical names that we plan to use in the query string. The plan is to 
provide one file with a list of all native locale names and the canonical names 
that they map to for all platforms. For efficiency, it would be nice that this 
table include other information that may be useful such as {{{MB_CUR_LEN}}} for 
each of those locales.
+ 
+ I've collected all of the locale data on each of the platforms that are 
available to me. During this process, I've noticed a few issues with the name 
mapping.
+ 
+ One issue is that a single native locale name may map to a different 
canonical locale name on different platforms. For example, {{{es_BO}}} maps to 
{{{es_BO.ISO-8859-15}}} on AIX, but it maps to {{{es_BO.ISO-8859-1}}} on Linux 
and SunOS. Consider that our mapping file would look something like this...
+ 
+ {{{
+   es_BO.ISO-8859-1     es_BO es_BO.ISO8859-1 es_BO.iso88591
+   es_BO.ISO-8859-15    es_BO es_BO.8859-15 ES_BO
+ }}}
+ 
+ If we look up the canonical name {{{es_BO.ISO-8859-1}}} we will see three 
possible locale names. If we look through our list of installed locales, we 
will find {{{es_BO}}}, but it would be wrong to return that locale because it 
doesn't actually match on this particular platform.
+ 
+ So one solution for this might be to get the codeset name and store it in the 
mapping. This assumes that it is safe to request a locale using with the a 
codeset even though the list of installed locales didn't specify the codset.
+ 
+ Another issue is that the data associated with each of the canonical locales, 
like {{{MB_CUR_LEN}}}, is different on each platform. The {{{ar_DZ.UTF-8}}} 
locale uses a 6 byte codeset on Linux, but a 4 byte codeset on other platforms.
+ 
+ I think the solution for this would be to not store the MB_CUR_LEN value in 
the file, but capture it and append it to the canonical locale name when we 
enumerate the installed locales.
+ 
+ [[Anchor(Part3)]]
+ = Part 3 (STDCXX-716) =
+ 
+ The proposed interface to all of this is a single public function named 
rw_query_locales(). The signature would be...
+ 
+ {{{
+   char* rw_query_locales(const char* query, size_t count);
+ }}}
+ 
+ The {{{query}}} parameter will be the query string. The {{{count}}} parameter 
is the maximum number of locales to return. This allows you to easily limit the 
number of locales tested.
+ 
+ The expected format of the query string is similar to what is described 
above, except that the requested MB_CUR_LEN value will be expected to be part 
of the query string. The accepted MB_CUR_LEN value would be seperated from the 
canonical locale name expression with a period. An example query string...
+ 
+ {{{
+    "zh_*.*.{5..3} *_FR.*.1"
+ }}}
+ 
+ This would match all 5, 4 and 3 byte encodings of the Chinese language in any 
country, then all 1 byte encodings for any language spoken in France.
+ 
+ '''Perhaps we should consider adding an additional parameter to prepend the 
C/POSIX locales as there is no way to match them using the canonical locale 
name matching rules we've laid out above.'''
  
  [[Anchor(References)]]
  = References =

[Stdcxx Wiki] Update of "LocaleLookup" by TravisVitek

Reply via email to