Hi Denys !

> The world seems to be standardizing on utf-8.
Thank God, supporting gazillion of encodings is no fun.

You say this, but libbb/unicode.c contains a unicode_strlen calling this complex mb to wc conversion function to count the number of characters. Those multi byte functions tend to be highly complex and slow (don't know if they have gone better). For just UTF-8, things can be optimized.

e.g.

size_t utf8len( const char* s )
{
  size_t n = 0;
  while (*s)
    if ((*s++ ^ 0x40) < 0xC0)
      n++;
  return n;
}

size_t mystrlen( const char* s )
{
  return utf8_enabled ? utf8len(s) : strlen(s);
}

This looks more, but avoids inclusion of mb function. Most compiler shall produce fast code for utf8len.

utf8len is for UTF-8 only usage, mystrlen may be used to switch betwean 8-bit-locale and UTF-8. If we could switch to UTF-8 only, we may forget of mystrlen and always use utf8len.


Another fast function I use for UTF-8 ... skip to Nth UTF-8 character in a string (returns a pointer to trailing \0 if N > number of UTF-8 chars in string):

char *utf8skip( char const* s, size_t n )
{
  for ( ; n && *s; --n )
    while ((*++s ^ 0x40) >= 0xC0);
  return (char*)s;
}


Those are examples, other functions may also be optimized. It all depends on the question if those darn big mb functions shall be used or not.

--
Harald

_______________________________________________
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox

Reply via email to