Re: grep is horriby slow in UTF-8 locales

Markus Kuhn Sun, 09 Nov 2003 16:51:49 -0800

Mika Fischer wrote on 2003-11-08 21:47 UTC:
> So it seems the slowdown occurs in the function mbrlen from libc.
> 
> The real problem is of course that this function is called once for
> every character of the input because grep makes a map of the input
> file containing the number of bytes of each character.
> 
> Obviously this is quite time consuming :)
[...]
> At least for UTF-8 it's easy to skip over any additional bytes a
> character might have, so that might be a workable solution.


I recall that we had about two years ago heated discussions here on
whether UTF-8 support should be implemented by

  a) hardwired mechanisms fully optimized to make good use of UTF-8's
     neat properties

  b) relying entirely on ISO C's generic multi-byte functions, to make
     sure that even stateful monsters like the ISO 2022 encodings
     are supported equally.

Unfortunately, it seems that grep has become an excellent teaching
example of how option b) can backfire with a ridiculous performance loss
in a basic text-processing tool.

Markus

-- 
Markus Kuhn, Computer Lab, Univ of Cambridge, GB
http://www.cl.cam.ac.uk/~mgk25/ | __oo_O..O_oo__


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ Call for Votes: misc.metric-system -- Interested in a new group      +
+ on the introduction of the metric system? Please look at             +
+ news.announce.newgroups, http://www.uvv.org/cgi-bin/getmsg/2440 or   +
+ send email to <[EMAIL PROTECTED]> for information on how +
+ cast your newsgroup creation vote. Ballot ends 25 November 2003.     +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: grep is horriby slow in UTF-8 locales

Reply via email to