Package: grep Tags: patch When ANYCHAR is included in a pattern in non-UTF8 locales, grep prefer to DFA engine to regex's. However, as long as I tested, even after have applied Patch#17025, regex engine is slower than DFA's for ANYCHAR in non-UTF8 locales.
This patch prefers regex to DFA for ANYCHAR in non-UTF8 locales. Create the text. $ yes abcd.abc | head -1000000 > m I tested below before applying it. $ time -p env LC_ALL=ja_JP.eucJP src/grep abcd.abd m real 1.99 user 1.75 sys 0.28 I re-tested after applying it. $ time -p env LC_ALL=ja_JP.eucJP src/grep abcd.abd m real 1.21 user 0.71 sys 0.46 Norihiro
>From d69cf4d289034a21067a6e0a7495921df0a2aac9 Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka <[email protected]> Date: Mon, 17 Mar 2014 20:41:25 +0900 Subject: [PATCH] grep: prefer regex to DFA for ANYCHAR in multi-byte locales * src/dfa.c (dfaexec): prefer regex to for ANYCHAR in multi-byte locales. --- src/dfa.c | 17 ++++++++++++----- 1 file changed, 12 insertions(+), 5 deletions(-) diff --git a/src/dfa.c b/src/dfa.c index 5e60cd5..1308c6b 100644 --- a/src/dfa.c +++ b/src/dfa.c @@ -3420,11 +3420,18 @@ dfaexec (struct dfa *d, char const *begin, char *end, equivalence classes. */ if (backref) { - *backref = 1; - free (mblen_buf); - free (inputwcs); - *end = saved_end; - return (char *) p; + int i; + for (i = 0; i < d->states[s].mbps.nelem; ++i) + if (d->tokens[d->states[s].mbps.elems[i].index] == MBCSET) + break; + if (i < d->states[s].mbps.nelem) + { + *backref = 1; + free (mblen_buf); + free (inputwcs); + *end = saved_end; + return (char *) p; + } } /* Can match with a multibyte character (and multi character -- 1.9.0
