Re: [bitc-dev] Unicode RegExp Hell

Jonathan S. Shapiro Mon, 28 Apr 2014 10:26:57 -0700

When I'm not being completely OCD about getting things right, I can be
regrettably lazy. In this case, that means that I took Russ's sketchy RE2
syntax specification at face value, and I didn't go off to look at the
pcrepattern man page to find out the real story. The results are
interesting and disturbing.


Russ mentions in passing in his discussion that there are special bytecodes
to recognize character classes, but he doesn't give that any particular
emphasis. There are a bunch of reasons to really like this approach.

I took a different route, implementing character range opcodes and
expanding each character class into its constituent ranges (either
positively or negatively). It definitely makes the NFA bigger, but I
figured I needed to generate a DFA in any case. I got to thinking about
cases like:

"(done)|(\p{L&}(\p{L&}|\d)*"

In english: match "done" or match a leading non-modifier letter followed by
zero or more non-modifier letters or digits


I had several thoughts in making this choice:

   1. Various parts of the RE specification allow negation in character
   classes. Implementing that correctly while treating classes as indivisible
   units seemed tricky.
   2. When the time came to build the DFA, the case of leading 'd' in the
   input has to go through a different path in the DFA than all of the other
   letters. It seemed like I'd ahve to break up the character class in any
   case as part of the DFA conversion process.
   3. We eventually need to support byte search in a general-purpose RE
   engine. That is: REs over binary.

* Class Negation

The character class negation issue (issue 1) turns out to be a red herring.
The reason it's a red herring is that character classes are only permitted
in two places: (a) in lieu of a literal matching character, and (b) as a
constituent of a [] character class (i.e. set inclusion). Both of those two
cases can be handled by negating the class in-place. There turns out to be
no case where a general RE subgraph needs to be negated.

Subgraph negation actually isn't that bad, but it can't be done with the
classical construction method. The classical method takes the form "advance
or fail". If we want to be able to negate matches, then every subgraph
needs to have two exit arcs - one for succeed and one for fail - and there
needs to be a distinguished fail node.method. It's not exactly a big
deviation conceptually, but it changes the implementation quite a bit.

* DFA Construction

After thinking about it for a while, I thought that splitting for DFA
support might not be so bad. I thought (mistakenly) that the branching
cases all involve individual characters. If so, then we could treat
matching at a given NFA state hierarchically: first deal with the
individual letter cases and then take whatever is left and run that through
the character class filters.

Unfortunately it isn't that simple. Consider an RE like:

"\p{Ll}|\p{Greek}x"

In English: Match a single lowercase letter, or any Greek letter followed
by 'x'.


For an NFA this pattern isn't a big deal. You'll run all of the relevant
paths in parallel and one or more of them will match. Unfortunately,
converting this to a DFA means that we need to compute the intersection of
\p{Ll} and \p{Greek}. Coming out of the first DFA state there are four
matches of interest:

   1. Thinks in \p{Ll} that are in \p{Greek}
   2. Things in \p{Ll} that are not in \p{Greek}
   3. Things in \p{Greek} that are not in \p{Ll}
   4. Anything else (which would be a failed match).

You can do this with character class bytecodes, but only if they are
non-consuming. If you have non-consuming match bytecodes, then you can
encode something like

if (peek() IN \p{Ll}) AND (peek() IN \p{Greek} THEN MoveToState s


and believe it or not, this actually makes sense as a practical
implementation in a real lexer. If you were going to code this by hand, you
certainly wouldn't go and expand out the character classes looking for an
optimization (though you might do special cases).

And notice that shifting the burden of consumption to the MoveToState
operation actually can be viewed as making sense in a DFA. the act of
matching can be viewed as happening in the node, or it can be viewed as
happening in the outbound transition. Since the two happen in lockstep,
either one is a perfectly good way of handling things.

But this tends to point us in an awkward direction: the bytecode machine
for the DFA now needs non-consuming matching operators. That either means I
need to go back and rethink the bytecode machine for NFAs or I need two
different bytecode machines for the two different classes of machine.

* Binary (Byte) REs

Having worked out how to handle character classes as special opcodes when
we are scanning by code point, that left me with one last problem: binary
REs. This is NOT something I'm going to implement now, but it's something I
want to think ahead about to make sure I'm not implementing myself into a
corner.

So first, let me note that the only place where this is complicated is when
we mix character and byte matching. The reason that's a problem is that
character matching assumes that the input pointer is always pointing at the
beginning of a code point when the matcher is initially called. If the
matcher is free to consume a single byte, the input assumption can easily
be violated. So on the one hand, it's begging for disaster, but on the
other hand there really *are* such things as mixed-mode input in the world.

We can obviously downconvert classes to byte sequences if we have to, so in
the limit this really isn't that big a deal. My personal preference would
be to delineate binary sub-REs and restrict those to byte-only matching
rules.

* Duff's Device

Ivan Goddard recently reminded me in private email that high-performance
lexical analyzer generators tend to process the input in words rather than
bytes, using a variant of Duff's Device to copy the input bytes. For
languages that use ASCII input. In effect, the search process is "unrolled"
and word comparisons are used rather than byte comparisons. The current
engine that I have doesn't even *try* to do this, but it isn't hard to see
how it might be done.

* Boyer-Moore

The lexical analysis problem is quite different from the problem of general
search on a long input string. In the case of long input strings, you'd
really like a cheap way to skip over substrings that cannot match. For
constant string matching it leads to sublinear-time matching. For lexical
analysis, it makes very little difference, because all matches are anchored.

I can see how to do this, but I think it wants a special-case matching
engine for fixed input strings. I haven't attempted to deal with it.



shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode RegExp Hell

Reply via email to