Hi all,

while working on a new portmaster version, I found that bsdgrep is much
faster in an UTF-8 locale than in the C locale, much to my surprise.

I have uploaded a small shell-script with test data that can be fetched
from:

        https://people.freebsd.org/~se/grep-test.txz

The script uses "grep -v -f patternfile datafile" to select from datafiles
the lines that are not matched by the contents of patternfile:

#-------------------------------------------------------------------
#!/bin/sh

LANG=en_US.UTF-8
LC_CTYPE=en_US.UTF-8

export LANG LC_CTYPE

time grep -v -f grep-test-pattern grep-test-data

LANG=C
LC_CTYPE=C
#unset LANG LC_CTYPE # is an alternative leading to the same result ...

time grep -v -f grep-test-pattern grep-test-data
#-------------------------------------------------------------------

The first "grep" needs 3.5 seconds to finish on my system, but the second
one (with LC_CTYPE=C or no locale set at all) runs for minutes (I did not
bother to check whether it finishes at all).

Is this a bug in grep?

Maybe there is something odd in the data file (loading the pattern is not
slower with LC_CTYPE=C, it takes 0.8 seconds on my system), but this is a
problem that was observed with "real" data, not a specifically constructed
worst case.

Any ideas what's causing this behavior?

I'm currently setting the UTF-8 locale as in the first invocation above
to make grep run in reasonable time, but I'd expect it to be faster in
the C locale ...

Regards, STefan
_______________________________________________
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"

Reply via email to