Markus Kuhn wrote:
On Red Hat 9:

$ grep --version
grep (GNU grep) 2.5.1
$ LC_ALL=en_GB.UTF-8 time grep XYZ test.txt
Command exited with non-zero status 1
6.83user 0.07system 0:06.93elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (157major+34minor)pagefaults 0swaps
$ LC_ALL=POSIX time grep XYZ test.txt
Command exited with non-zero status 1
0.07user 0.09system 0:00.16elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (125major+24minor)pagefaults 0swaps

where test.tx is just http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
repeated 10 times.

Wow, I dunno what's going on here. Here are the results on my system (also RedHat 9):


$ grep --version
grep (GNU grep) 2.5.1
$ LC_ALL=en_GB.UTF-8 time grep XYZ test.txt
Command exited with non-zero status 1
1.14user 0.04system 0:01.19elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (156major+32minor)pagefaults 0swaps
$ LC_ALL=POSIX time grep XYZ test.txt
Command exited with non-zero status 1
0.01user 0.03system 0:00.03elapsed 102%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (125major+25minor)pagefaults 0swaps

It seems grep performs about 100x worse in a UTF-8 locale than in and
ASCII locale, even where the search strring contains no regex
metacharacters.

grep is slower on my system, but it doesn't appear to be as bad as on your system.


In UTF-8 mode, grep is also much slower than the equivalent Perl:

$ LC_ALL=en_GB.UTF-8 time perl -ne '/XYZ/ && print' test.txt
1.49user 0.05system 0:01.55elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (339major+45minor)pagefaults 0swaps
$ LC_ALL=POSIX time perl -ne '/XYZ/ && print' test.txt
1.17user 0.09system 0:01.28elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (322major+45minor)pagefaults 0swaps

$ LC_ALL=en_GB.UTF-8 time perl -ne '/XYZ/ && print' test.txt 0.30user 0.01system 0:00.33elapsed 92%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (341major+45minor)pagefaults 0swaps $ LC_ALL=POSIX time perl -ne '/XYZ/ && print' test.txt 0.19user 0.06system 0:00.24elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (325major+44minor)pagefaults 0swaps

Any suggestions? It would be nice not to be penalized like this by grep
for using a UTF-8 locale by default.

Sorry buddy, I have no idea :(


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/



Reply via email to