Thien-Thi Nguyen <t...@gnuvola.org> writes: > () Mark H Weaver <m...@netris.org> > () Thu, 17 Mar 2011 21:38:28 -0400 > > If we may assume that the searched string is valid UTF-8, and when only > ASCII characters are excluded (e.g. "."), then three additional states > are required in the generated DFA. Let us call them S1, S2, and S3. > > [handling these states] > > When non-ASCII characters are excluded, additional states must be added: > one for each unique prefix of the excluded multibyte characters. It's > quite straightforward. > > I don't understand what "excluded" means here. Is this a property > of the string, the regexp, the (dynamic, environmental) operation, or ...?
The excluded characters are the ones listed inside of a [^...] form. For example, [^abc] excludes only the ASCII characters #\a, #\b, and #\c. [^abcλ] excludes also the non-ASCII character #\λ. "." is equivalent to a [^...] form that excludes only newlines. I should also note that although UTF-8 causes a few complications when compiling regexps, UTF-32 creates much worse problems that require a different DFA data structure and a slower scan. In the UTF-8 case, each state in the DFA can contain a fixed lookup table with 256 entries (one for each byte) as is used for ASCII or Latin1. However, in the UTF-32 case, it is _not_ practical for each state to contain a lookup table with over 1 million entries (one for each code point). Mark