On Wed, Aug 13, 2014 at 07:06:38PM +0200, Harald Becker wrote: > Hi Denys ! > > > The world seems to be standardizing on utf-8. > >Thank God, supporting gazillion of encodings is no fun. > > You say this, but libbb/unicode.c contains a unicode_strlen calling > this complex mb to wc conversion function to count the number of > characters. Those multi byte functions tend to be highly complex and > slow (don't know if they have gone better). For just UTF-8, things > can be optimized.
This depends on your libc. In musl, the only thing slow about them is having to account for some idiotic special-cases the standard allows (special meanings for null pointers in each of the arguments) and even that should not be slow on machines with proper branch prediction. > e.g. > > size_t utf8len( const char* s ) > { > size_t n = 0; > while (*s) > if ((*s++ ^ 0x40) < 0xC0) > n++; > return n; > } This function is only valid if the string is known to be valid UTF-8. Otherwise it hides errors, which may or may not be problematic depending on what you're using it for. > Another fast function I use for UTF-8 ... skip to Nth UTF-8 > character in a string (returns a pointer to trailing \0 if N > > number of UTF-8 chars in string): > > char *utf8skip( char const* s, size_t n ) > { > for ( ; n && *s; --n ) > while ((*++s ^ 0x40) >= 0xC0); > return (char*)s; > } This code is invalid; it's assuming char is unsigned. In practice, *++s ^ 0x40 is going to be negative on most archs. Better would be doing an unsigned range check like (unsigned char)*++s-0x80<0x40U. Of course it also gets tripped up badly on invalid sequences. Rich _______________________________________________ busybox mailing list busybox@busybox.net http://lists.busybox.net/mailman/listinfo/busybox