[going back on the list]
Thanks for this trimmed-down code. It at least gives an impression of
what is going on in GNU regex. But I would still like to start from scratch
for libunistring,
1. because I want code that works the same way in UTF-8 as in UTF-32,
You can do that, I think, incrementally from the current code. It
should be possible to tweak it so that OP_UTF8_PERIOD just "folds" into
OP_PERIOD when you're processing UTF-8, and so on.
2. in the hope that I can find a set of node/operators that is more
efficient (peephole optimization, like you said it) and possibly
integrate the "kwset" trick in some way.
I think the kwset search should be done separately (i.e. as
preprocessing). The tree-based representation of regcomp.c should be
quite amenable to extracting the required keywords.
One inefficiency I have in the posted code is that I try every single
starting point, which makes for O(n^2) performance instead of O(n). It
should be easy to prepend a .* in regcomp.c to fix this. But for
everything else it is a pretty natural DFA implementation.
There are other optimizations possible, for example I think you only
need MB_CUR_MAX elements in the state log so you could make it 8-entries
long (or 4 if you only want the million-something Unicode characters).
The current implementation still has a state log as big as the original
string, which is a remnant of the support for backreferences.
It would be nice to have a POSIX (or almost POSIX) mode in libunistring.
The multibyte DFA code is so bitrot in GNU grep that relying on
libunistring for UTF-8 locales would be a good cleanup...
Paolo