So, here are my patches for 2.6. While they are big, they are worth considering because they fix three serious bugs. They also speedup DFA, so that it is faster than glibc regex on cases that it can handle. They also make grep faster than with the egf-speedup patch on all cases where I knew that patch would help. And all this while removing more code than they add (if you don't count testcases).
Patches 1 to 9 are simple cleanups, including enabling more syntax checks and getting rid of DFA's own x*alloc functions (which is the way to go if, long term, DFA moves into gnulib). The dfa.c after this patch is suitable for merging into gawk. Patch 10 adds more UTF-8 test cases (and multibyte in general) to make sure nothing breaks. Patch 11 is the patch I already posted regarding the handling of case folding for MB_CUR_MAX. Using it for gawk would break IGNORECASE. I still would like to include this patch because it fixes two very bad bugs with -i: a regex like foo\W is broken with -i, and -o/--color are broken with -i too. One solution to make this patch palatable to gawk would be to add more "feature bits" to dfasyntax, that specify whether dfaexec can make some assumptions about the input. Alternatively, dfa.c could use the newly-added newline-as-sentinel behavior for DFA to perform the conversion to lowercase and to wide character on a per-line basis. The former is simpler to do; the latter requires care in order to avoid performing the conversion twice or more---which would slow down grep. Patches 12 and 13 speed up handling of simple bracket expressions under multi-byte character sets. The former also applies to non-UTF-8, the latter is only for UTF-8 and provides the bulk of the speedup. Patches 14 and 15 implement one optimization from glibc regex, that is to match UTF-8 strings using the fast single-byte algorithms whenever applicable. Patch 16 removes one of the two sources of inefficiency of -i with multibyte character sets, i.e. the multibyte_props array. This is useless since we never scan the buffer backwards. The other source of inefficiency is the conversion to lowercase, which is very much related to patch 11. As a workaround, patch 17 matches line-by-line in the current worst case. Part of this inefficiency is due to dfa (see comments above), but since that is not the whole story I opted for this workaround. Paolo Bonzini (17): kwset/system: remove ptr_t grep: cleanup one const cast dfa: get rid of x*alloc dfa: remove CRANGE dead code dfa, grep: cleanup if-before-free and cast-of-argument-to-free grep: fix error-message-uppercase syntax-check: enable makefile-TAB-only-indentation syntax-check: enable m4-quote-check syntax-check: enable space-tab tests: add more UTF-8 test cases dfa: rewrite handling of multibyte case folding dfa: speed up handling of brackets dfa: optimize simple character sets under UTF-8 charsets dfa: cache MB_CUR_MAX for dfaexec dfa: run simple UTF-8 regexps as a single-byte character set grep: remove check_multibyte_string, fix non-UTF8 missed match grep: match multibyte charsets line-by-line when using -i .x-sc_avoid_if_before_free | 2 - .x-sc_cast_of_alloca_return_value | 1 - .x-sc_cast_of_x_alloc_return_value | 1 - .x-sc_space_tab | 1 + Makefile.am | 2 +- cfg.mk | 5 - configure.ac | 2 +- src/dfa.c | 1052 +++++++++++++++++------------------- src/dfa.h | 24 +- src/grep.c | 108 ++--- src/kwset.h | 3 +- src/search.c | 262 +++++---- src/system.h | 12 - tests/Makefile.am | 6 +- tests/case-fold-backslash-w | 14 + tests/euc-mb | 23 + tests/foad1.sh | 10 +- tests/spencer1-locale.awk | 30 + tests/spencer1-locale.sh | 20 + tests/status.sh | 8 +- 20 files changed, 803 insertions(+), 783 deletions(-) delete mode 100644 .x-sc_avoid_if_before_free delete mode 100644 .x-sc_cast_of_alloca_return_value delete mode 100644 .x-sc_cast_of_x_alloc_return_value create mode 100644 .x-sc_space_tab create mode 100755 tests/case-fold-backslash-w create mode 100644 tests/euc-mb create mode 100644 tests/spencer1-locale.awk create mode 100755 tests/spencer1-locale.sh
