bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales

Pádraig Brady Sat, 11 Jan 2014 06:18:18 -0800

On 01/11/2014 11:33 AM, Pádraig Brady wrote:
> On 01/11/2014 05:40 AM, Jim Meyering wrote:
>> On Fri, Jan 10, 2014 at 8:52 PM, Jim Meyering <[email protected]> wrote:
>>>> I wonder might this faster path be restricted to a safer but very common 
>>>> input subset of:
>>>>
>>>> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80))
>>>
>>> That sounds like a good approach.
>>> Now I need another test case, to demonstrate that the current code can
>>> cause trouble.
>>
>> Hmm... after thinking about this for a while and actually trying to
>> break the current code (did not find a way to demonstrate a regression),
>> I have concluded that the current approach is no worse than the prior
>> one of matching a case-mapped regexp vs. each case-mapped input line.
>>
>> That's not to say that it's perfect, of course.
>> The "LATIN SMALL LETTER J WITH CARON, COMBINING DOT BELOW" example
>> from gnulib's test-ulc-casecmp.c is a great example: this matches:
>>
>>     printf '\x6A\xCC\x8C\xCC\xA3\n'|src/grep -i "$(printf
>> '\x6A\xCC\x8C\xCC\xA3')"
>>
>> but this does not, yet probably should:
>>
>>     printf '\xC7\xB0\xCC\xA3\n'|src/grep -i "$(printf 
>> '\x6A\xCC\x8C\xCC\xA3')"
>>
>> Can you see a way to demonstrate a regression?
> 
> Oh right, it doesn't handle these cases already.
> Fair enough I don't see a regression then.


This is also a good summary of stuff to consider with case:
http://www.unicode.org/faq/casemap_charprop.html

So picking another case situation from there:
  "in the Greek script, capital sigma (U+03A3) is the uppercase form of both
   the regular (U+03C2) and final (U+03C3) lowercase sigma."

One can see that sed handles this:
  $ printf '\u03C2\u03C3\n' | sed 's/.*/&\U&/'
  ςσΣΣ
  $ printf '\u03A3\n' | sed 's/.*/&\L&/'
  Σσ

Though I was surprised the grep (2.14) didn't match any combo of these
  $ printf '\u03C2\u03C3\n' | grep -Fi "$(printf \u03A3)"
  $ printf '\u03A3\n' | grep -Fi "$(printf \u03C2)"
  $ printf '\u03A3\n' | grep -Fi "$(printf \u03C3)"

Not a regression of course.

cheers,
Pádraig.

bug#16232: [PATCH] grep: make --ignore-case (-i) faster (sometimes 10x) in multibyte locales

Reply via email to