[PATCH 00/16] my last hefty patch drop

Paolo Bonzini Fri, 12 Mar 2010 09:49:34 -0800

So, here are my patches for 2.6.  While they are big, they are worth
considering because they fix three serious bugs.  They also speedup
DFA, so that it is faster than glibc regex on cases that it can handle.
They also make grep faster than with the egf-speedup patch on all cases
where I knew that patch would help.  And all this while removing more
code than they add (if you don't count testcases).


Patches 1 to 9 are simple cleanups, including enabling more syntax checks
and getting rid of DFA's own x*alloc functions (which is the way to go
if, long term, DFA moves into gnulib).  The dfa.c after this patch is
suitable for merging into gawk.

Patch 10 adds more UTF-8 test cases (and multibyte in general) to make
sure nothing breaks.

Patch 11 is the patch I already posted regarding the handling of case
folding for MB_CUR_MAX.  Using it for gawk would break IGNORECASE.
I still would like to include this patch because it fixes two very bad
bugs with -i: a regex like foo\W is broken with -i, and -o/--color are
broken with -i too.

One solution to make this patch palatable to gawk would be to add
more "feature bits" to dfasyntax, that specify whether dfaexec can
make some assumptions about the input.  Alternatively, dfa.c could
use the newly-added newline-as-sentinel behavior for DFA to perform
the conversion to lowercase and to wide character on a per-line basis.
The former is simpler to do; the latter requires care in order to avoid
performing the conversion twice or more---which would slow down grep.

Patches 12 and 13 speed up handling of simple bracket expressions under
multi-byte character sets.  The former also applies to non-UTF-8, the
latter is only for UTF-8 and provides the bulk of the speedup.

Patches 14 and 15 implement one optimization from glibc regex, that is
to match UTF-8 strings using the fast single-byte algorithms whenever
applicable.

Patch 16 removes one of the two sources of inefficiency of -i with
multibyte character sets, i.e. the multibyte_props array.  This is
useless since we never scan the buffer backwards.

The other source of inefficiency is the conversion to lowercase, which
is very much related to patch 11.  As a workaround, patch 17 matches
line-by-line in the current worst case.  Part of this inefficiency is
due to dfa (see comments above), but since that is not the whole story
I opted for this workaround.


Paolo Bonzini (17):
  kwset/system: remove ptr_t
  grep: cleanup one const cast
  dfa: get rid of x*alloc
  dfa: remove CRANGE dead code
  dfa, grep: cleanup if-before-free and cast-of-argument-to-free
  grep: fix error-message-uppercase
  syntax-check: enable makefile-TAB-only-indentation
  syntax-check: enable m4-quote-check
  syntax-check: enable space-tab
  tests: add more UTF-8 test cases
  dfa: rewrite handling of multibyte case folding
  dfa: speed up handling of brackets
  dfa: optimize simple character sets under UTF-8 charsets
  dfa: cache MB_CUR_MAX for dfaexec
  dfa: run simple UTF-8 regexps as a single-byte character set
  grep: remove check_multibyte_string, fix non-UTF8 missed match
  grep: match multibyte charsets line-by-line when using -i

 .x-sc_avoid_if_before_free         |    2 -
 .x-sc_cast_of_alloca_return_value  |    1 -
 .x-sc_cast_of_x_alloc_return_value |    1 -
 .x-sc_space_tab                    |    1 +
 Makefile.am                        |    2 +-
 cfg.mk                             |    5 -
 configure.ac                       |    2 +-
 src/dfa.c                          | 1052 +++++++++++++++++-------------------
 src/dfa.h                          |   24 +-
 src/grep.c                         |  108 ++---
 src/kwset.h                        |    3 +-
 src/search.c                       |  262 +++++----
 src/system.h                       |   12 -
 tests/Makefile.am                  |    6 +-
 tests/case-fold-backslash-w        |   14 +
 tests/euc-mb                       |   23 +
 tests/foad1.sh                     |   10 +-
 tests/spencer1-locale.awk          |   30 +
 tests/spencer1-locale.sh           |   20 +
 tests/status.sh                    |    8 +-
 20 files changed, 803 insertions(+), 783 deletions(-)
 delete mode 100644 .x-sc_avoid_if_before_free
 delete mode 100644 .x-sc_cast_of_alloca_return_value
 delete mode 100644 .x-sc_cast_of_x_alloc_return_value
 create mode 100644 .x-sc_space_tab
 create mode 100755 tests/case-fold-backslash-w
 create mode 100644 tests/euc-mb
 create mode 100644 tests/spencer1-locale.awk
 create mode 100755 tests/spencer1-locale.sh

[PATCH 00/16] my last hefty patch drop

Reply via email to