Re: Possible Unicode Problems in Busybox - Collect and Discussion

Harald Becker Wed, 13 Aug 2014 10:07:29 -0700

Hi Denys !

> The world seems to be standardizing on utf-8.

Thank God, supporting gazillion of encodings is no fun.

You say this, but libbb/unicode.c contains a unicode_strlen calling thiscomplex mb to wc conversion function to count the number of characters.Those multi byte functions tend to be highly complex and slow (don'tknow if they have gone better). For just UTF-8, things can be optimized.


e.g.

size_t utf8len( const char* s )
{
  size_t n = 0;
  while (*s)
    if ((*s++ ^ 0x40) < 0xC0)
      n++;
  return n;
}

size_t mystrlen( const char* s )
{
  return utf8_enabled ? utf8len(s) : strlen(s);
}

This looks more, but avoids inclusion of mb function. Most compilershall produce fast code for utf8len.

utf8len is for UTF-8 only usage, mystrlen may be used to switch betwean8-bit-locale and UTF-8. If we could switch to UTF-8 only, we may forgetof mystrlen and always use utf8len.

Another fast function I use for UTF-8 ... skip to Nth UTF-8 character ina string (returns a pointer to trailing \0 if N > number of UTF-8 charsin string):


char *utf8skip( char const* s, size_t n )
{
  for ( ; n && *s; --n )
    while ((*++s ^ 0x40) >= 0xC0);
  return (char*)s;
}

Those are examples, other functions may also be optimized. It alldepends on the question if those darn big mb functions shall be used or not.


--
Harald

_______________________________________________
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox

Re: Possible Unicode Problems in Busybox - Collect and Discussion

Reply via email to