RE: low hanging fruit while cleaning up test failures

Travis Vitek Mon, 28 Jan 2008 09:54:02 -0800

 

>Martin Sebor wrote:
>
>Travis Vitek wrote:
>> 
>> Okay, I think I've finally got something that will be useful 
>to someone. I'm
>> attaching the patch to STDCXX-608
>> [https://issues.apache.org/jira/browse/STDCXX-608] for review.
>> 
>> There is a lot of code, and the feature is not 100% complete 
>> yet. I need to create a test, deprecate the old rw_locales()
>> function, and come up with a way to locate the input files
>> without actually requiring that the environment variable TOPDIR
>> be defined.
>> 
>> The system is fairly simple from the public interface. A new 
>> public type rw_locale_entry_t
>
>What's the advantage of returning a list instead of just a character
>string like rw_locales() does? (Wouldn't it be simpler to just stick
>with the same interface?)
>


The advantage was not having to duplicate the data. I already have the
locale name in each of the rw_locale_entry_t objects and didn't see any
reason to duplicate the data in that list.

That said, I believe that I can easily write that on top of the
framework that I have now. That would minimize the impact on the
existing tests, which would be an advantage.

>
>I assume the rw_locale_entry_t members language, country, and encodings
>are populated with our canonical names for each, correct?
>

Yes. All members are canonical except for the `name' field.

>Is the rw_locale_entry_t::name member populated with a string specific
>to each operating system? If so, which of the two forms does it use:
>the one returned by locale -a or the one returned by setlocale()?
>

The name member holds the result of the call to setlocale (LC_CTYPE,
name), which should just be a single name. This is something that I
wasn't absolutely sure about. See below...

>FYI, the setlocale() names can be really long on some platforms (e.g.,
>on HP-UX, they always take the form:
>/<category>/<category>/<category>/<category>/<category>/<category>
>so 64 characters may not be enough for all locales).

I thought this was only the case when using setlocale (LC_ALL, ...)
because the OS needs to return a string that indicates which locales are
used for each category so that the return of setlocale (LC_ALL, 0) can
be used to restore the locale to a previous state. I believe that this
is the way that it works on HP and AIX.

So here is the thing that I am concerned about. The previous code
allowed you to specify which locale facet you wanted to get locales for.
I didn't understand how or why this was useful. I believe that I do now.
Say a call to setlocale (LC_ALL, "X") returns the string "/A/B/C/D/E/F".
If I just capture the result of setlocale (LC_CTYPE, 0), I'm not going
to see that the other facets are set differently. I need to store the
result of locale -a, or the names of the locales used by each of the
components.

>
>> has been added. It represents is a link in a linked list
>> of installed locales. The new function rw_all_locales() 
>> gives you a pointer to the first item in a sorted list of
>> installed locales. Another new function rw_locale_query()
>> takes a query string and a count, and it returns a pointer
>> to the first entry in a linked list of locale entries that
>> match the provided query string. The count parameter is
>> used to limit the number of locales in the linked list.
>
>Is the user responsible for freeing the linked list? If so, how?
>

No. It is 'memory in use at program termination', just like the locale
name buffer in the original rw_locales() function.

=========|=========|=========|=========|=========|=========|=========|==
===|

>> As an example, imagine that I want to find up to 10 locales 
>> for Japan or China that have MB_CUR_LEN of 4 or 3. You could
>> get that list of locales
>> with the following query...
>> 
>>   const rw_locale_entry_t* e = rw_locale_query ("C=JP|CN M=4|3", 10);
>
>I assume the AND operator is implicit between the subexpressions?
>I.e., the query is equivalent to
>
>   (C == "JP" OR C == "CN") && (MB_CUR_LEN == 3 || MB_CUR_LEN == 4)
>
>I'd like to make a simplifying suggestion regarding the query syntax
>[ducks] ;-) First, I'd like to suggest to drop the attributes C, E,
>and L, and instead assume the standard canonical locale name in the
>form <language>_<country>.<encoding>. Second, since we already have
>simple pattern matching in the form of rw_fnmatch() and since at at
>some point we'll need to add shell brace expansion (e.g., for the
>expected failures project), I'd like to propose that rather than
>using our own special syntax here and pattern matching and brace
>expansion elsewhere, we start with both here as well.

This is similar to what I started out with. The only disadvantage is
that it doesn't allow you to prioritize one attribute over another.
Nobody ever said we cared about order, but I assumed that we would.

>
>With that, the first part of the query string above would look like
>this: "*_{JP,CN}.*"
>
>The shell brace expansion syntax looks something like this:
>
>   string     ::= <brace-expr> | [ <chars> ]
>   brace-expr ::= <string> '{' <brace-list> '}' <string> | <string>
>   brace-list ::= <string> ',' <brace-list> | <string>
>   chars      ::= <pcs-char> <string> | <pcs-char>
>   pcs-char   ::= character in the Portable Character Set
>
>For the rest of the query I wonder if we could come up with a more
>conventional (and possibly more expressive) syntax that could use
>in other areas as well. I'm thinking something loosely based on
>grep might work, with multiple lines representing a disjunction of
>the expressions on each line, and with subexpressions on the same
>line being representing a conjunction of the subexpressions.
>
>So the query string from your example above would look like this:
>
>   *_{JP,CN}.* {3,4}
>
>Internally it would translate into multiple grep-like expressions
>(i.e., arguments to the -e grep option) looking like this:
>
>   *_JP.* 3\n
>   *_JP.* 4\n
>   *_CN.* 3\n
>   *_CN.* 4\n
>

Yes, this would work fine provided that you didn't ever want to get all
4 byte encodings before the 3 byte encodings. I like the syntax much
better though.

>with the whole thing basically being a simplified grep pattern that
>could be used to search in a plain text file in this format:
>
>   <locale> <mb-cur-max> <alias-list>
>

BTW, I never did find a way to get an alias for a locale. If I have the
name of a locale, how can I find the list of aliases?

>If we also wanted to include, say, an English locale in UTF, we
>would write:
>
>   *_{JP,CN}.*  *{3,4}\n
>   en_*.UTF-8
>
>I realize this is a little different from what I outlined earlier
>but after protyping the "expected failures" solution I think the
>bracket expression will be a very handy tool to add to the driver,
>and since we already have pattern matching in rw_fnmatch() we might
>as well put it to good use.
>

No problem. I'm getting used to doing it the wrong way once or twice
before producing something that is usable. :P

>Martin
>
>

RE: low hanging fruit while cleaning up test failures

Reply via email to