Re: horrible utf-8 performace in wc

Bruno Haible Fri, 06 Jun 2008 15:43:43 -0700

Pádraig Brady wrote:
> There have been some interesting "counting UTF-8 strings" threads
> over at reddit lately, all referenced from this article:
> http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html


But before these techniques can be used in practice in packages such as
coreutils, two problems would have to be solved satisfactorily:

  1) "George Pollard makes the assumption that the input string is valid UTF-8".
     This assumption cannot be upheld, as long as you use the same type
     ('char *') for UTF-8 encoded strings and normal C strings, or when
     you occasionally convert between one and the other.

     For example: Assume NAME is really a valid UTF-8 string.
     A program then does

       static char buf[20];
       snprintf (buf, "%s", NAME);
       utf8_strlen (buf);

     Boing! You already have a buffer overrun: The snprintf can truncate
     an UTF-8 character, and the utf8_strlen function then skips over the
     terminating NUL byte and scans buf[21...infinity], and likely crashes.

  2) We already have the problem that we want to keep good performance when
     handling strings in the "C" locale or, more generally, in a unibyte locale.
     So we get code duplication:
       - code for unibyte locales,
       - code for multibyte locales that uses mbrtowc().
     If you want to optimize UTF-8 locales particularly, i.e. optimize away
     the function calls inherent in mbrtowc(), then we get code triplication:
       - code for unibyte locales,
       - code for UTF-8 locales,
       - code for multibyte locales other than UTF-8, that uses mbrtowc().
     So, code size increases, and the testing requirements increase as well.

Bruno




_______________________________________________
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Re: horrible utf-8 performace in wc

Reply via email to