On Fri, Jan 10, 2014 at 5:49 PM, Pádraig Brady <[email protected]> wrote: > Cool so it does this transformation: > > sed 's/./[\L&\U&]/g' > > Though multi byte case handling has all sorts of edge cases (pardon the pun), > and it may not be always valid to treat each character independently? > For example see some of the tests in: > http://git.sv.gnu.org/gitweb/?p=gnulib.git;a=blob;f=tests/unicase/test-ulc-casecmp.c;hb=HEAD
It seems you're right. Since it's a many-to-one mapping in some cases, simply using one lower case character and one upper case version won't cover all possibilities. > I wonder might this faster path be restricted to a safer but very common > input subset of: > > (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80)) That sounds like a good approach. Now I need another test case, to demonstrate that the current code can cause trouble. > Also are the following printfs in the test redundant? > >> +data=$( printf "I:$I $i:i") >> +search_str=$(printf "$i:i I:$I") Good catch. Those were vestiges of pre-factoring code, where they were needed. Here's the patch to fix that part, in your name:
From 97d3430c75a9dd82d871eca170b13c1f8d895fad Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]> Date: Fri, 10 Jan 2014 20:42:53 -0800 Subject: [PATCH] tests: remove superfluous uses of printf * tests/turkish-eyes: Remove unnecessary uses of printf. --- tests/turkish-eyes | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tests/turkish-eyes b/tests/turkish-eyes index 323eb35..68301e7 100755 --- a/tests/turkish-eyes +++ b/tests/turkish-eyes @@ -34,8 +34,8 @@ echo I | LC_ALL=$L grep -i i > /dev/null \ I=$(printf '\304\260') # capital I with dot i=$(printf '\304\261') # lowercase dotless i -data=$( printf "I:$I $i:i") -search_str=$(printf "$i:i I:$I") + data="I:$I $i:i" +search_str="$i:i I:$I" printf "$data\n" > in || framework_failure_ LC_ALL=$L grep -i "^$search_str\$" in > out || fail=1 -- 1.8.5.2.229.g4448466
