I recall that we had about two years ago heated discussions here on
whether UTF-8 support should be implemented by

 a) hardwired mechanisms fully optimized to make good use of UTF-8's
    neat properties

 b) relying entirely on ISO C's generic multi-byte functions, to make
    sure that even stateful monsters like the ISO 2022 encodings
    are supported equally.

Unfortunately, it seems that grep has become an excellent teaching
example of how option b) can backfire with a ridiculous performance loss
in a basic text-processing tool.



Its not uncommon for code to be written in assembly to gain performance increases on certain platforms, usually much smaller increases than a factor of 100. Since UTF-8 is the future most common encoding, writing special case code to deal with UTF-8 is even better than that because all platforms can benefit from it equally.


UTF-8 is specifically designed to be as efficient as possible, sticking to the clib multibyte API is a disservice. (I am biased though, because I hardcode everything I write to it and specifically avoid generic multibyte support.)



--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/



Reply via email to