Thien-Thi Nguyen <t...@gnuvola.org> writes:
> In unibyte land, "." matches a byte.  OK.
>
> In multibyte land done "bytewise", "." matches ____________.
> (What goes in the blank?)

"." (and more generally [^...]) is equivalent to (a|b|c|d|...) where
every valid UTF-8 character is present in the disjunction except for the
excluded characters.  However, this case does indeed require special
consideration for the sake of efficiency in the regexp compiler.

If we may assume that the searched string is valid UTF-8, and when only
ASCII characters are excluded (e.g. "."), then three additional states
are required in the generated DFA.  Let us call them S1, S2, and S3.  S1
accepts one arbitrary byte, i.e. it has an outgoing edge for every byte
which leads to the final state.  S2 accepts any two bytes, i.e. it has
an outgoing edge for every byte which leads to S1.  Similarly for S3.

Now, the start state of "." (or [^...]) has outgoing edges for every
byte from 0x80 to 0xFF, leading to S1, S2, or S3, depending on the high
bits of the first byte.  Specifically, byte values 0xC0 through 0xDF
lead to S1, 0xE0 through 0xEF lead to S2, and 0xF0 through 0xF7 lead to
S3.

When non-ASCII characters are excluded, additional states must be added:
one for each unique prefix of the excluded multibyte characters.  It's
quite straightforward.

   Regards,
     Mark

Reply via email to