On Fri, Jan 10, 2014 at 5:49 PM, Pádraig Brady <[email protected]> wrote:
> Cool so it does this transformation:
>
>   sed 's/./[\L&\U&]/g'
>
> Though multi byte case handling has all sorts of edge cases (pardon the pun),
> and it may not be always valid to treat each character independently?
> For example see some of the tests in:
> http://git.sv.gnu.org/gitweb/?p=gnulib.git;a=blob;f=tests/unicase/test-ulc-casecmp.c;hb=HEAD

It seems you're right.  Since it's a many-to-one mapping in some
cases, simply using one lower case character and one upper case
version won't cover all possibilities.

> I wonder might this faster path be restricted to a safer but very common 
> input subset of:
>
> (MB_CUR_MAX == 1 || (in_utf8 && *c < 0x80))

That sounds like a good approach.
Now I need another test case, to demonstrate that the current code can
cause trouble.

> Also are the following printfs in the test redundant?
>
>> +data=$(      printf "I:$I $i:i")
>> +search_str=$(printf "$i:i I:$I")

Good catch.  Those were vestiges of pre-factoring code, where they
were needed.  Here's the patch to fix that part, in your name:
From 97d3430c75a9dd82d871eca170b13c1f8d895fad Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]>
Date: Fri, 10 Jan 2014 20:42:53 -0800
Subject: [PATCH] tests: remove superfluous uses of printf

* tests/turkish-eyes: Remove unnecessary uses of printf.
---
 tests/turkish-eyes | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tests/turkish-eyes b/tests/turkish-eyes
index 323eb35..68301e7 100755
--- a/tests/turkish-eyes
+++ b/tests/turkish-eyes
@@ -34,8 +34,8 @@ echo I | LC_ALL=$L grep -i i > /dev/null \
 I=$(printf '\304\260') # capital I with dot
 i=$(printf '\304\261') # lowercase dotless i

-data=$(      printf "I:$I $i:i")
-search_str=$(printf "$i:i I:$I")
+      data="I:$I $i:i"
+search_str="$i:i I:$I"
 printf "$data\n" > in || framework_failure_

 LC_ALL=$L grep -i "^$search_str\$" in > out || fail=1
-- 
1.8.5.2.229.g4448466

Reply via email to