Thien-Thi Nguyen <t...@gnuvola.org> writes:

> () Mark H Weaver <m...@netris.org>
> () Thu, 17 Mar 2011 21:38:28 -0400
>
>    If we may assume that the searched string is valid UTF-8, and when only
>    ASCII characters are excluded (e.g. "."), then three additional states
>    are required in the generated DFA.  Let us call them S1, S2, and S3.
>
>    [handling these states]
>
>    When non-ASCII characters are excluded, additional states must be added:
>    one for each unique prefix of the excluded multibyte characters.  It's
>    quite straightforward.
>
> I don't understand what "excluded" means here.  Is this a property
> of the string, the regexp, the (dynamic, environmental) operation, or ...?

The excluded characters are the ones listed inside of a [^...] form.
For example, [^abc] excludes only the ASCII characters #\a, #\b, and
#\c.  [^abcλ] excludes also the non-ASCII character #\λ.

"." is equivalent to a [^...] form that excludes only newlines.


I should also note that although UTF-8 causes a few complications when
compiling regexps, UTF-32 creates much worse problems that require a
different DFA data structure and a slower scan.

In the UTF-8 case, each state in the DFA can contain a fixed lookup
table with 256 entries (one for each byte) as is used for ASCII or
Latin1.  However, in the UTF-32 case, it is _not_ practical for each
state to contain a lookup table with over 1 million entries (one for
each code point).

    Mark

Reply via email to