Re: [fltk.development] utf-8 to do list - checking the basics ?

Greg Ercolano Tue, 31 Mar 2009 08:45:05 -0700

matthiasm wrote:
> On 31.03.2009, at 01:25, Duncan Gibson wrote:
> 
>> I've just been looking through fl_utf8.cxx and came across:
>>
>> /*
>> * return 0 if the strings are equal;
>> * return 1 if s1 is greater than s2
>> * return -1 if s1 is less than s2
>> */
>> int fl_utf_strcasecmp(const char *s1, const char *s2)
>> {
>>        int s1_l = strlen(s1);
>>        int s2_l = strlen(s2);
>>
>>        if (s1_l < s2_l) {
>>                return -1;
>>        } else if (s1_l > s2_l) {
>>                return 1;
>>        }
>>        return fl_utf_strncasecmp(s1, s2, s1_l);
>> }
> 
> Um, that is very wrong in many respects:
> 
> strlen() does return the size of the array in bytes, not the length of  
> the string in characters. It can not be used in this case.
> 
> Now if we do have s1_l and s2_l, we must do a strncasecmp with the len  
> set to the minimum of both. If s1_l equals s2_l we can simply return  
> the result of strncasecmp. The same is true if the result is not zero.
> 
> If strncasecmp returns zero and s1 is longer than s2, we must return  
> 1, and if s2 is longer than s1, we return -1, because "abcd" comes  
> before "abcde", just like "Miller" is before "Millers" in the phone  
> book.


        Peer review! ;)

        Yes, and that could probably be optimized a bit too, since if both
        strings are 1000 chars long, but are different by the first few chars,
        the two strlen()'s still walk the entire length of both strings..
        painful if used in a sort().. a common use of strcmp() and friends.

        This looks like one of those situations where the code should be
        replicated instead of reused, optimized so strlen()'s aren't needed.

        At very least, I'd say add a \todo saying it needs future optimization
        to avoid the strlen's.

        We should probably have regression tests for this as well, to test
        for common problems and boundary cases.

        I had to do this with one of my string libraries.. basically a
        small test program that does a lot of string tests, and throws
        error messages to stderr, and exit codes so that 'make test'
        can fail on errors.

        Could be as simple as:

int main() {
  int passed = 0;
  {
    const char *a = "AbCdEf", *b = "abcdef";
    int expect = 0, got = fl_utf_strcasecmp(a,b);
    if ( expect != got ) {
      fprintf(stderr, "FAIL: fl_utf_strcasecmp(%s,%s): got %d, expected 
%d\n",a,b,got,expect);
      exit(1);
    }
    ++passed;
  }
  {
    const char *a = "Miller", *b = "Millers";
    int expect = -1, got = fl_utf_strcasecmp(a,b);
    got = ( got < 0 ) ? -1 : (got > 0) ? 1 : got;
    if ( expect != got ) {
      fprintf(stderr, "FAIL: fl_utf_strcasecmp(%s,%s): got %d, expected 
%d\n",a,b,got,expect);
      exit(1);
    }
    ++passed;
  }
  ..etc..
  fprintf(stderr, "PASSED: %d TESTS\n", passed);
  exit(0);
}

        Possibly we can nab existing utf8 stress testing code from
        somewhere.
_______________________________________________
fltk-dev mailing list
fltk-dev@easysw.com
http://lists.easysw.com/mailman/listinfo/fltk-dev

Re: [fltk.development] utf-8 to do list - checking the basics ?

Reply via email to