On 02/03/2014 08:20 AM, Norihiro Tanaka wrote:
echo 'LJ' | LC_ALL=en_US.UTF-8 grep -i Lj
echo 'Lj' | LC_ALL=en_US.UTF-8 grep -i LJ
We expect that LJ and Lj are returned, respectively. But both return
nothing.
Both test cases worked for me. I expect that you meant the cases with
single characters, as in "echo lj | LC_ALL=en_US.UTF-8 grep -i Lj".
I have doubts about this patch, for several reasons.
1. It doesn't solve the problem from the ordinary user's point of view.
For example, "echo lj | LC_ALL=en_US.UTF-8 src/grep -i Lj" will still
output nothing, because the one-character pattern "Lj" does not match the
two-character string "lj" even when the latter's two-letter case
variants "Lj", "lJ", "LJ" are considered.
2. The characters in question are present in Unicode only for
compatibility with previous standards; they're not intended to be used
in new text. So this is a problem of the past, one that has mostly died
out already.
3. Because of (2) the characters in question are rare, even in the
languages where one might naively think they're useful. For example, the
Croatian Wikipedia page for Ljubljana
<http://hr.wikipedia.org/wiki/Ljubljana> consistently uses the
two-character forms "Lj" and "lj", not the one-character forms "Lj" and "lj".
4. The solution doesn't generalize to similar problems in
more-complicated orthographies. For example, in polytonic Greek when
ignoring case ordinary users would expect "ᾄ" (U+1F84) to match not only
"ᾌ" (U+1F8C), but also "Α" (U+0391), "ΑΙ" (U+0391, U+0399; two
characters) and "Αι" (U+0391, U+03B9). Worse, this depends on context:
often "ᾄ" should not match "Αι" when ignoring case. For details on this,
please see Nick Nicholas's discussion "Titlecase and Adscripts"
<http://www.tlg.uci.edu/~opoudjis/unicode/unicode_adscript.html>.
5. When POSIX specifies how to match a regular expression while ignoring
case, it talks only about "uppercase or lowercase"
<http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_02>.
If we change 'grep' along the lines being suggested, we'll either have
to change POSIX, or have the change take effect only if POSIXLY_CORRECT
is not set.
Taking all this into consideration, it sounds like we should let
sleeping dogs lie, i.e., that dfa.c should do the minimal work necessary
needed to support traditional case-insensitive matching a la POSIX.