Travis Vitek wrote:
Martin Sebor wrote:
But we do need to come up with a sound specification of the query syntax
before implementing any more code.
Okay, the proposed query syntax grammar essentially the same as that being
used for the <config> value in xfail.txt. So we have
<match> is a shell globbing pattern in the format below. All fields
are required.
iso-country ::= ISO-639-1 or ISO-639-2 two or three character country
code
iso-language ::= ISO-3166 two character language code
iana-codeset ::= IANA codeset name with '-' replaced or removed
Or escaped or quoted? E.g., UTF\-8 or "UTF-8" If it's all the same
to you I would prefer to keep the IANA names unchanged. A good
number of them use the dash to separate two numeric parts of the
name from each other (e.g., ISO-8859-1 and ISO-8859-13) so dropping
the dash would make it difficult to tell one from the other, and
replacing the dash would mean finding a suitable character for the
replacement that's not used in any of the names and that's easy
enough to remember (I suppose the equals sign might qualify if
we had to go that route).
match ::=
<iso-language-expr>-<iso-country-expr>-<iana-codeset-expr>-<mb_cur_len-expr>
match_list ::= match | match ' ' match_list
So the previous example to select `en_US.*' with a 1 byte encoding or
`zh_*.UTF-8' with a 2, 3, or 4 byte encoding would use the following query
string.
en-US-*-1 zh-*-UTF8-2 zh-*-UTF8-3 zh-*-UTF8-4
Okay, this makes it clear that space is an OR. The AND is implicit
in the dash, and there's no need for the '\n'.
This long expression could be written using a brace expansion to simplify
it.
en-US-*-1 zh-*-UTF8-{2,3,4}
I propose that we not support the BRE syntax, simply because it is so
complex.
Which part are you suggesting we not support? I ask because I don't
recall us talking about supporting the full BRE or anything beyond
the subset already implemented in rw_fnmatch().
Yes, it might be quite easy to prototype a solution using grep and
other shell utilities, but providing a complete implementatoin in C [where
we actually need it] is going to be difficult at best. For what we need,
shell globbing should be sufficient to handle the cases that we need to
satisfy the objective.
I suppose you could consider en-US-*-1 is "language=en" and "country=US" and
"codeset=*" and "mb_cur_len=1" so '-' represents an intersection operation,
but I prefer to think of the entire expression to be either a match or not a
match.
Sure. I personally don't see a difference between the two from
a practical POV.
Martin Sebor wrote:
I think it's great
to put together a prototype at the same time, just as long as it's
understood that the prototype might need to change as we discover
flaws in it or better ways of doing it.
I have no problem with flaws or small improvements. When we start talking
about implementing a regular expression parser I get concerned.
I fully agree that implementing regular expressions just for this
project would be overkill. I don't think I ever suggested that we
implement BRE for this though. If I ever mentioned BRE (e.g., on
the wiki) I was referring to the subset used for fnmatch globbing.
If I somehow gave the impression that I was proposing we implement
it now I apologize for confusing things.
Martin