On Sun, 13 Oct 2019 15:29:04 +0200 Hans Åberg via Unicode <unicode@unicode.org> wrote:
> > On 13 Oct 2019, at 15:00, Richard Wordingham via Unicode > > I'm now beginning to wonder what you are claiming. > I start with a NFA with no empty transitions and apply the subset DFA > construction dynamically for a given string along with some reverse > NFA-data that is enough to transverse backwards when a final state > arrives. The result is a NFA where all transverses is a match of the > string at that position. And then the speed comparison depends on how quickly one can extract the match information required from that data structure. Incidentally, at least some of the sizes and timings I gave seem to be wrong even for strings. They won't work with numeric quantifiers, as in /[ab]{0,20}[ac]{10,20}[ad]{0,20}e/. One gets lesser issues in quantifying complexity if one wants "Å" to match \p{Lu} when working in NFD - potentially a different state for each prefix of the capital letters. (It's also the case except for UTF-32 if characters are treated as sequences of code units.) Perhaps 'upper case letter that Unicode happens to have encoded as a single character' isn't a concept that regular expressions need to support concisely. What's needed is to have a set somewhere between [\p{Lu}&\p{isNFD}] and [\p{Lu}],though perhaps it should be extended to include "ff" - there are English surnames like "ffrench". Regards, Richard.