Re: grep is horriby slow in UTF-8 locales

jmaiorana Mon, 10 Nov 2003 23:23:20 -0800

I recall that we had about two years ago heated discussions here on
whether UTF-8 support should be implemented by
 a) hardwired mechanisms fully optimized to make good use of UTF-8's
    neat properties
 b) relying entirely on ISO C's generic multi-byte functions, to make
    sure that even stateful monsters like the ISO 2022 encodings
    are supported equally.
Unfortunately, it seems that grep has become an excellent teaching example of how option b) can backfire with a ridiculous performance loss in a basic text-processing tool.

Its not uncommon for code to be written in assembly to gain performance increases on certain platforms, usually much smaller increases than a factor of 100. Since UTF-8 is the future most common encoding, writing special case code to deal with UTF-8 is even better than that because all platforms can benefit from it equally.

UTF-8 is specifically designed to be as efficient as possible, sticking to the clib multibyte API is a disservice. (I am biased though, because I hardcode everything I write to it and specifically avoid generic multibyte support.)

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: grep is horriby slow in UTF-8 locales

Reply via email to