On 01/11/2014 11:33 AM, Pádraig Brady wrote: > On 01/11/2014 05:40 AM, Jim Meyering wrote: >> On Fri, Jan 10, 2014 at 8:52 PM, Jim Meyering <[email protected]> wrote: >>>> I wonder might this faster path be restricted to a safer but very common >>>> input subset of: >>>> >>>> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80)) >>> >>> That sounds like a good approach. >>> Now I need another test case, to demonstrate that the current code can >>> cause trouble. >> >> Hmm... after thinking about this for a while and actually trying to >> break the current code (did not find a way to demonstrate a regression), >> I have concluded that the current approach is no worse than the prior >> one of matching a case-mapped regexp vs. each case-mapped input line. >> >> That's not to say that it's perfect, of course. >> The "LATIN SMALL LETTER J WITH CARON, COMBINING DOT BELOW" example >> from gnulib's test-ulc-casecmp.c is a great example: this matches: >> >> printf '\x6A\xCC\x8C\xCC\xA3\n'|src/grep -i "$(printf >> '\x6A\xCC\x8C\xCC\xA3')" >> >> but this does not, yet probably should: >> >> printf '\xC7\xB0\xCC\xA3\n'|src/grep -i "$(printf >> '\x6A\xCC\x8C\xCC\xA3')" >> >> Can you see a way to demonstrate a regression? > > Oh right, it doesn't handle these cases already. > Fair enough I don't see a regression then.
This is also a good summary of stuff to consider with case: http://www.unicode.org/faq/casemap_charprop.html So picking another case situation from there: "in the Greek script, capital sigma (U+03A3) is the uppercase form of both the regular (U+03C2) and final (U+03C3) lowercase sigma." One can see that sed handles this: $ printf '\u03C2\u03C3\n' | sed 's/.*/&\U&/' ςσΣΣ $ printf '\u03A3\n' | sed 's/.*/&\L&/' Σσ Though I was surprised the grep (2.14) didn't match any combo of these $ printf '\u03C2\u03C3\n' | grep -Fi "$(printf \u03A3)" $ printf '\u03A3\n' | grep -Fi "$(printf \u03C2)" $ printf '\u03A3\n' | grep -Fi "$(printf \u03C3)" Not a regression of course. cheers, Pádraig.
