On Tue, Nov 26, 2019 at 2:38 PM Norihiro Tanaka <[email protected]> wrote: > On Sun, 13 Jan 2019 08:45:47 +0900 > Norihiro Tanaka <[email protected]> wrote: > > grep uses KWset matcher for multiple word matching. It is very slow when > > most of the parts matched to a pattern are not words. So, if a part firstly > > matched to pattern is not a word, use the grep matcher to match for its > > line. > > > > By the way, if START_PTR is set, grep matcher uses regex matcher which is > > very slow to match words. Therefore, we use grep matcher when only > > START_PTR > > is not set. > > > > Example, although it is a very extreme case... > > > > $ cat >pat <<EOF > > 0 > > 00 0 > > 00 00 0 > > 00 00 00 0 > > 00 00 00 00 0 > > 00 00 00 00 00 0 > > 00 00 00 00 00 00 0 > > 00 00 00 00 00 00 00 0 > > 00 00 00 00 00 00 00 00 0 > > 00 00 00 00 00 00 00 00 00 0 > > 00 00 00 00 00 00 00 00 00 00 0 > > 00 00 00 00 00 00 00 00 00 00 00 0 > > 00 00 00 00 00 00 00 00 00 00 00 00 0 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 0 > > EOF > > $ yes '00 00 00 00 00 00 00 00 00 00 00 00 00' | head -1000000 >inp > > > > $ env LC_ALL=C time -p src/grep -wf pat inp > > real 5.75 > > user 5.67 > > sys 0.02 > > > > Retry after applied the patch. > > > > $ env LC_ALL=C time -p src/grep -wf pat inp > > real 0.32 > > user 0.31 > > sys 0.00 > > > > Thanks, > > Norihiro > > I fix previous patch. > > This change should not be applied for multibyte locales, as grep matcher > uses regex with pattern with invert charclass in word matching in > multibyte locales and it is very slow.
Good timing, in both senses :-) Thank you.
This confirms a ~15x speedup for that test case:
$ t="$(yes 00 |head -14|paste -sd' ') 0"; while :; do echo $t; test
"$t" = 0 && break; t=${t#00 }; done > pat;cat pat
00 00 00 00 00 00 00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 0
00 00 00 00 00 00 0
00 00 00 00 00 0
00 00 00 00 0
00 00 00 0
00 00 0
00 0
0
$ env LC_ALL=C time -p grep -wf pat inp
...
before: real 4.75
after: real 0.30
Similarly, retrying my counterexample, I now see a significant
improvement (needed a larger input file to demonstrate):
$ yes '00 00 00 00 00 00 00 00 00 00 00 00 00' | head -5000000 > inp
Best of ten, running this, before vs after:
$ env LC_ALL=C time -p grep -w 000 inp
0.62s vs 0.43s
I have revised the wording in the commit message. Please ACK that. The
only other change was in formatting, to split a line that went one
byte past the 80-column maximum.
l'd like to push the attached some time tomorrow.
kws.diff
Description: Binary data
