RE: grep is horriby slow in UTF-8 locales
Forwarded from someone who has encountered problems posting on linux-utf8 (who maintains list list server?): --- Forwarded Message I wasn't able to post to linux-utf8. Did you guys receive my message? If I implemented a fix would you be interested? I haven't yet... Please let me know if you are interested in getting this situation fixed. Brad From: Chen, Brad [EMAIL PROTECTED] Sent: Saturday, June 05, 2004 4:04 PM To: '[EMAIL PROTECTED]' Subject: RE: grep is horriby slow in UTF-8 locales From the proposed patch: - if (MB_CUR_MAX 1 mb_properties[beg - buf] == 0) -continue; + if (MB_CUR_MAX 1) + { +memset(cur_state, 0, sizeof(mbstate_t)); + if (mbrlen(beg + offset, buf + size - beg, cur_state) 0) + continue; /* It is a part of multibyte character. */ + } This code does not appear to be functionally equivalent to what was there before. In the old version, mb_properties[i] would be 0 only if the byte in question was part of a multi-byte character and not the first byte. For the cases where the new code reaches continue, the original code would have had mb_properties[i] == 1 and would not have behaved the same way. Am I misreading this code? Another thing you might want to tidy up here; mbrlen returns a size_t which is unsigned, so a 0 comparison will get you into trouble on some systems. I confess I haven't correctness tested either version of the code yet. I was just looking at performance. If you have a=20 favorite correctness test case please send it. Best Wishes, Brad Chen Intel Corporation SSG/Performance Tools Lab --- End of Forwarded Message -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: grep is horriby slow in UTF-8 locales
Markus Kuhn wrote: b) relying entirely on ISO C's generic multi-byte functions, to make sure that even stateful monsters like the ISO 2022 encodings are supported equally. Use of mbrlen is not done because of ISO 2022 encodings (which are not usable as locale encodings!), but because of the non-UTF-8 multibyte encodings: EUC-JP, Big5, GB18030 etc. Bruno -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: grep is horriby slow in UTF-8 locales
I recall that we had about two years ago heated discussions here on whether UTF-8 support should be implemented by a) hardwired mechanisms fully optimized to make good use of UTF-8's neat properties b) relying entirely on ISO C's generic multi-byte functions, to make sure that even stateful monsters like the ISO 2022 encodings are supported equally. Unfortunately, it seems that grep has become an excellent teaching example of how option b) can backfire with a ridiculous performance loss in a basic text-processing tool. Its not uncommon for code to be written in assembly to gain performance increases on certain platforms, usually much smaller increases than a factor of 100. Since UTF-8 is the future most common encoding, writing special case code to deal with UTF-8 is even better than that because all platforms can benefit from it equally. UTF-8 is specifically designed to be as efficient as possible, sticking to the clib multibyte API is a disservice. (I am biased though, because I hardcode everything I write to it and specifically avoid generic multibyte support.) -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: grep is horriby slow in UTF-8 locales
Mika Fischer wrote on 2003-11-08 21:47 UTC: So it seems the slowdown occurs in the function mbrlen from libc. The real problem is of course that this function is called once for every character of the input because grep makes a map of the input file containing the number of bytes of each character. Obviously this is quite time consuming :) [...] At least for UTF-8 it's easy to skip over any additional bytes a character might have, so that might be a workable solution. I recall that we had about two years ago heated discussions here on whether UTF-8 support should be implemented by a) hardwired mechanisms fully optimized to make good use of UTF-8's neat properties b) relying entirely on ISO C's generic multi-byte functions, to make sure that even stateful monsters like the ISO 2022 encodings are supported equally. Unfortunately, it seems that grep has become an excellent teaching example of how option b) can backfire with a ridiculous performance loss in a basic text-processing tool. Markus -- Markus Kuhn, Computer Lab, Univ of Cambridge, GB http://www.cl.cam.ac.uk/~mgk25/ | __oo_O..O_oo__ + Call for Votes: misc.metric-system -- Interested in a new group + + on the introduction of the metric system? Please look at + + news.announce.newgroups, http://www.uvv.org/cgi-bin/getmsg/2440 or + + send email to [EMAIL PROTECTED] for information on how + + cast your newsgroup creation vote. Ballot ends 25 November 2003. + -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: grep is horriby slow in UTF-8 locales
On Fri, Nov 07, 2003 at 12:52:44PM +, Markus Kuhn wrote: $ grep --version grep (GNU grep) 2.5.1 $ LC_ALL=en_GB.UTF-8 time grep XYZ test.txt Command exited with non-zero status 1 6.83user 0.07system 0:06.93elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (157major+34minor)pagefaults 0swaps $ LC_ALL=POSIX time grep XYZ test.txt Command exited with non-zero status 1 0.07user 0.09system 0:00.16elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (125major+24minor)pagefaults 0swaps FYI: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=206470 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=181378 I've noticed this, too. I often use LANG=C for grepping due to this. Someone mentioned --with-included-regex, but that's not good enough (a 10% gain for me). -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: grep is horriby slow in UTF-8 locales
Markus Kuhn [EMAIL PROTECTED] writes: $ grep --version grep (GNU grep) 2.5.1 This doesn't happen with: $ grep --version grep (GNU grep) 2.4.2 $ LC_ALL=POSIX time grep XYZ test.txt Command exited with non-zero status 1 0.03user 0.07system 0:00.36elapsed 27%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (118major+25minor)pagefaults 0swaps $ LC_ALL=sr_CS.UTF-8 time grep XYZ test.txt Command exited with non-zero status 1 0.06user 0.05system 0:00.10elapsed 105%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (143major+50minor)pagefaults 0swaps $ LC_ALL=en_GB.UTF-8 time grep XYZ test.txt Command exited with non-zero status 1 0.06user 0.04system 0:00.15elapsed 64%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (128major+48minor)pagefaults 0swaps $ LC_ALL=POSIX time grep XYZ test.txt Command exited with non-zero status 1 0.04user 0.06system 0:00.10elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (118major+25minor)pagefaults 0swaps Last example shows that CPU usage is not really any kind of rule to base conculsions on (sr_CS.UTF-8 is my everyday locale, and I would really notice if grep had any problems with it). test.txt was produced with: for i in 1 2 3 4 5 6 7 8 9 0; do cat UnicodeData.txt test.txt; done I can get a newer grep today, if you think I may experience different results with it. Cheers, Danilo -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: grep is horriby slow in UTF-8 locales
Rob Park wrote on 2003-11-08 00:49 UTC: grep is slower on my system, but it doesn't appear to be as bad as on your system. Your results show that grep in UTF-8 mode is equally 100x slower than in single-byte mode, just like on my system (300 MHz P3). You just have used a faster CPU. Markus -- Markus Kuhn, Computer Lab, Univ of Cambridge, GB http://www.cl.cam.ac.uk/~mgk25/ | __oo_O..O_oo__ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: grep is horriby slow in UTF-8 locales
Markus Kuhn wrote: Your results show that grep in UTF-8 mode is equally 100x slower than in single-byte mode, just like on my system (300 MHz P3). You just have used a faster CPU. D'oh :) -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: grep is horriby slow in UTF-8 locales
Hi Markus, Markus Kuhn [EMAIL PROTECTED] writes: Rob Park wrote on 2003-11-08 00:49 UTC: grep is slower on my system, but it doesn't appear to be as bad as on your system. Your results show that grep in UTF-8 mode is equally 100x slower than in single-byte mode, just like on my system (300 MHz P3). You just have used a faster CPU. I tried this with grep 2.5 (the latest available from ftp.gnu.org/gnu/grep/, because of the crack) and it still shows decent results on my home Slackware GNU/Linux 8.0 (GNU libc 2.2.3 I think): $ LC_ALL=POSIX time grep2.5 XYZ test.txt Command exited with non-zero status 1 0.03user 0.06system 0:00.14elapsed 63%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (130major+22minor)pagefaults 0swaps $ LC_ALL=en_GB.UTF-8 time grep2.5 XYZ test.txt Command exited with non-zero status 1 0.05user 0.07system 0:00.12elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (140major+45minor)pagefaults 0swaps I'm using Celeron 700MHz. I cannot build grep from CVS right now, but I still suspect this is not a grep problem. Cheers, Danilo -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: grep is horriby slow in UTF-8 locales
On Fri, Nov 07, 2003 at 04:49:58PM +0100, Danilo Segan wrote: This doesn't happen with: $ grep --version grep (GNU grep) 2.4.2 This was probably before full multibyte support was added to grep; the issue here specifically only happens in multibyte encodings. (My grep is slow in en_US.UTF-8, and fast in en_US.ISO-8859-1.) Try: # echo tést | grep 't.st' tést # echo tést | grep 't[aé]st' tést $ LC_ALL=POSIX time grep XYZ test.txt Command exited with non-zero status 1 0.04user 0.06system 0:00.10elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (118major+25minor)pagefaults 0swaps Last example shows that CPU usage is not really any kind of rule to base conculsions on (sr_CS.UTF-8 is my everyday locale, and I would really notice if grep had any problems with it). The field you should be reading is user. CPU is roughly (user+system)/elapsed, and isn't very relevant here. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: grep is horriby slow in UTF-8 locales
Hi! * Markus Kuhn [EMAIL PROTECTED] [2003-11-07 16:33]: It seems grep performs about 100x worse in a UTF-8 locale than in and ASCII locale, even where the search strring contains no regex metacharacters. Same here on Debian with grep 2.5.1 and libc 2.3.2. There is technically no reason, why grep should have to be any slower in a UTF-8 locale than in a single-byte locale if the string does not even contain any regex meta characters at all. In that case, UTF-8 can be processed just like ASCII. [...] Any suggestions? It would be nice not to be penalized like this by grep for using a UTF-8 locale by default. Diagnosis: I profiled grep and got the following: LC_ALL=POSIX snip Each sample counts as 0.01 seconds. % cumulative self self total time seconds secondscalls ms/call ms/call name 100.00 0.52 0.52 274 1.90 1.90 bmexec ... snip LC_ALL=de_DE.UTF-8 snip Each sample counts as 0.01 seconds. % cumulative self self total time seconds secondscalls s/call s/call name 76.80 1.39 1.39 274 0.01 0.01 check_multibyte_string 22.65 1.80 0.41 274 0.00 0.00 bmexec ... snip The check_multibyte_string function: snip static char* check_multibyte_string(char const *buf, size_t size) { char *mb_properties = malloc(size); mbstate_t cur_state; int i; memset(cur_state, 0, sizeof(mbstate_t)); memset(mb_properties, 0, sizeof(char)*size); for (i = 0; i size ;) { size_t mbclen; mbclen = mbrlen(buf + i, size - i, cur_state); if (mbclen == (size_t) -1 || mbclen == (size_t) -2 || mbclen == 0) { /* An invalid sequence, or a truncated multibyte character. We treat it as a singlebyte character. */ mbclen = 1; } mb_properties[i] = mbclen; i += mbclen; } return mb_properties; } snip So it seems the slowdown occurs in the function mbrlen from libc. The real problem is of course that this function is called once for every character of the input because grep makes a map of the input file containing the number of bytes of each character. Obviously this is quite time consuming :) A special case for non-UTF8 regexps has problems with regexps that contain . and similar things. A more general approach would be better IMO. Perhaps it's faster to match a regexp by skipping over any additional bytes of a MB-character in case of a . or similar things. Then one could just take the byte representation of the regexp and try to match it. At least for UTF-8 it's easy to skip over any additional bytes a character might have, so that might be a workable solution. Cheers, Mika -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: grep is horriby slow in UTF-8 locales
[EMAIL PROTECTED] (Danilo egan) writes: $ LC_ALL=en_GB.UTF-8 time grep2.5 XYZ test.txt Command exited with non-zero status 1 0.05user 0.07system 0:00.12elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (140major+45minor)pagefaults 0swaps Whoops, this above is total crap. I didn't have en_GB.UTF-8 locale installed at all, so when I created it using: # localedef -f UTF-8 -i en_GB en_GB.UTF-8 I got the following results, which matches your experiences: $ LC_ALL=en_GB.UTF-8 time grep2.5 XYZ test.txt Command exited with non-zero status 1 2.85user 0.13system 0:03.18elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (161major+55minor)pagefaults 0swaps Yet, grep 2.4 doesn't inhibit these problems at all (guess I won't be fast at upgrading :). Cheers, Danilo -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
grep is horriby slow in UTF-8 locales
On Red Hat 9: $ grep --version grep (GNU grep) 2.5.1 $ LC_ALL=en_GB.UTF-8 time grep XYZ test.txt Command exited with non-zero status 1 6.83user 0.07system 0:06.93elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (157major+34minor)pagefaults 0swaps $ LC_ALL=POSIX time grep XYZ test.txt Command exited with non-zero status 1 0.07user 0.09system 0:00.16elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (125major+24minor)pagefaults 0swaps where test.tx is just http://www.unicode.org/Public/UNIDATA/UnicodeData.txt repeated 10 times. It seems grep performs about 100x worse in a UTF-8 locale than in and ASCII locale, even where the search strring contains no regex metacharacters. And fgrep is no better. There is technically no reason, why grep should have to be any slower in a UTF-8 locale than in a single-byte locale if the string does not even contain any regex meta characters at all. In that case, UTF-8 can be processed just like ASCII. In UTF-8 mode, grep is also much slower than the equivalent Perl: $ LC_ALL=en_GB.UTF-8 time perl -ne '/XYZ/ print' test.txt 1.49user 0.05system 0:01.55elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (339major+45minor)pagefaults 0swaps $ LC_ALL=POSIX time perl -ne '/XYZ/ print' test.txt 1.17user 0.09system 0:01.28elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (322major+45minor)pagefaults 0swaps Any suggestions? It would be nice not to be penalized like this by grep for using a UTF-8 locale by default. Markus -- Markus Kuhn, Computer Laboratory, University of Cambridge http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: grep is horriby slow in UTF-8 locales
Markus Kuhn wrote: On Red Hat 9: $ grep --version grep (GNU grep) 2.5.1 $ LC_ALL=en_GB.UTF-8 time grep XYZ test.txt Command exited with non-zero status 1 6.83user 0.07system 0:06.93elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (157major+34minor)pagefaults 0swaps $ LC_ALL=POSIX time grep XYZ test.txt Command exited with non-zero status 1 0.07user 0.09system 0:00.16elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (125major+24minor)pagefaults 0swaps where test.tx is just http://www.unicode.org/Public/UNIDATA/UnicodeData.txt repeated 10 times. Wow, I dunno what's going on here. Here are the results on my system (also RedHat 9): $ grep --version grep (GNU grep) 2.5.1 $ LC_ALL=en_GB.UTF-8 time grep XYZ test.txt Command exited with non-zero status 1 1.14user 0.04system 0:01.19elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (156major+32minor)pagefaults 0swaps $ LC_ALL=POSIX time grep XYZ test.txt Command exited with non-zero status 1 0.01user 0.03system 0:00.03elapsed 102%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (125major+25minor)pagefaults 0swaps It seems grep performs about 100x worse in a UTF-8 locale than in and ASCII locale, even where the search strring contains no regex metacharacters. grep is slower on my system, but it doesn't appear to be as bad as on your system. In UTF-8 mode, grep is also much slower than the equivalent Perl: $ LC_ALL=en_GB.UTF-8 time perl -ne '/XYZ/ print' test.txt 1.49user 0.05system 0:01.55elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (339major+45minor)pagefaults 0swaps $ LC_ALL=POSIX time perl -ne '/XYZ/ print' test.txt 1.17user 0.09system 0:01.28elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (322major+45minor)pagefaults 0swaps $ LC_ALL=en_GB.UTF-8 time perl -ne '/XYZ/ print' test.txt 0.30user 0.01system 0:00.33elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (341major+45minor)pagefaults 0swaps $ LC_ALL=POSIX time perl -ne '/XYZ/ print' test.txt 0.19user 0.06system 0:00.24elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (325major+44minor)pagefaults 0swaps Any suggestions? It would be nice not to be penalized like this by grep for using a UTF-8 locale by default. Sorry buddy, I have no idea :( -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/