Hi!

* Markus Kuhn <[EMAIL PROTECTED]> [2003-11-07 16:33]:
> It seems grep performs about 100x worse in a UTF-8 locale than in and
> ASCII locale, even where the search strring contains no regex
> metacharacters.

Same here on Debian with grep 2.5.1 and libc 2.3.2.

> There is technically no reason, why grep should have to be any slower in
> a UTF-8 locale than in a single-byte locale if the string does not even
> contain any regex meta characters at all. In that case, UTF-8 can be
> processed just like ASCII.
[...]
> Any suggestions? It would be nice not to be penalized like this by grep
> for using a UTF-8 locale by default.

Diagnosis:
I profiled grep and got the following:
LC_ALL=POSIX
----snip----
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
100.00      0.52     0.52      274     1.90     1.90  bmexec
...
----snip----

LC_ALL=de_DE.UTF-8
----snip----
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 76.80      1.39     1.39      274     0.01     0.01  check_multibyte_string
 22.65      1.80     0.41      274     0.00     0.00  bmexec
...
----snip----

The check_multibyte_string function:
----snip----
static char*
check_multibyte_string(char const *buf, size_t size)
{ 
  char *mb_properties = malloc(size);
  mbstate_t cur_state;
  int i;
  memset(&cur_state, 0, sizeof(mbstate_t));
  memset(mb_properties, 0, sizeof(char)*size);
  for (i = 0; i < size ;)
    { 
      size_t mbclen;
      mbclen = mbrlen(buf + i, size - i, &cur_state);

      if (mbclen == (size_t) -1 || mbclen == (size_t) -2 || mbclen ==
      0)
        { 
          /* An invalid sequence, or a truncated multibyte character.
             We treat it as a singlebyte character.  */
          mbclen = 1;
        }
      mb_properties[i] = mbclen;
      i += mbclen;
    }

  return mb_properties;
}
----snip----

So it seems the slowdown occurs in the function mbrlen from libc.

The real problem is of course that this function is called once for
every character of the input because grep makes a map of the input
file containing the number of bytes of each character.

Obviously this is quite time consuming :)

A special case for non-UTF8 regexps has problems with regexps that
contain "." and similar things.

A more general approach would be better IMO. Perhaps it's faster to
match a regexp by skipping over any additional bytes of a MB-character
in case of a "." or similar things. Then one could just take the byte
representation of the regexp and try to match it.

At least for UTF-8 it's easy to skip over any additional bytes a
character might have, so that might be a workable solution.

Cheers,
 Mika
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to