Hi! * Markus Kuhn <[EMAIL PROTECTED]> [2003-11-07 16:33]: > It seems grep performs about 100x worse in a UTF-8 locale than in and > ASCII locale, even where the search strring contains no regex > metacharacters.
Same here on Debian with grep 2.5.1 and libc 2.3.2. > There is technically no reason, why grep should have to be any slower in > a UTF-8 locale than in a single-byte locale if the string does not even > contain any regex meta characters at all. In that case, UTF-8 can be > processed just like ASCII. [...] > Any suggestions? It would be nice not to be penalized like this by grep > for using a UTF-8 locale by default. Diagnosis: I profiled grep and got the following: LC_ALL=POSIX ----snip---- Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 100.00 0.52 0.52 274 1.90 1.90 bmexec ... ----snip---- LC_ALL=de_DE.UTF-8 ----snip---- Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 76.80 1.39 1.39 274 0.01 0.01 check_multibyte_string 22.65 1.80 0.41 274 0.00 0.00 bmexec ... ----snip---- The check_multibyte_string function: ----snip---- static char* check_multibyte_string(char const *buf, size_t size) { char *mb_properties = malloc(size); mbstate_t cur_state; int i; memset(&cur_state, 0, sizeof(mbstate_t)); memset(mb_properties, 0, sizeof(char)*size); for (i = 0; i < size ;) { size_t mbclen; mbclen = mbrlen(buf + i, size - i, &cur_state); if (mbclen == (size_t) -1 || mbclen == (size_t) -2 || mbclen == 0) { /* An invalid sequence, or a truncated multibyte character. We treat it as a singlebyte character. */ mbclen = 1; } mb_properties[i] = mbclen; i += mbclen; } return mb_properties; } ----snip---- So it seems the slowdown occurs in the function mbrlen from libc. The real problem is of course that this function is called once for every character of the input because grep makes a map of the input file containing the number of bytes of each character. Obviously this is quite time consuming :) A special case for non-UTF8 regexps has problems with regexps that contain "." and similar things. A more general approach would be better IMO. Perhaps it's faster to match a regexp by skipping over any additional bytes of a MB-character in case of a "." or similar things. Then one could just take the byte representation of the regexp and try to match it. At least for UTF-8 it's easy to skip over any additional bytes a character might have, so that might be a workable solution. Cheers, Mika -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/