Re: [bitc-dev] Unicode RegExp Hell

Jonathan S. Shapiro Wed, 23 Apr 2014 06:32:41 -0700

On Tue, Apr 22, 2014 at 10:38 PM, William ML Leslie <
[email protected]> wrote:


> > Trust me: a table of ranges is much smaller than a precomputed byte array
> > for this. I need to get the code up where people can see it.
>
> Or maybe a BDD?
>

I thought about that. A BDD is a good choice to use as a filter, but it's
not as easy to negate one. In this particular case, the tables of ordered
ranges are small, so I don't know that a BDD gives any great advantage.


> > The L class is the class of lowercase unicode characters. The transition
> on
> > 'a' really needs to go to a different state than the transition on \p{L}.
> > This means that the DFA converter actually needs to be able to unpack
> > unicode classes. At the moment, I'm not even preserving them in the NFA.
> I
> > go ahead and expand them into their constituent subranges. This makes
> > character class negation easier in any case.
>
> I had been expanding them only for the ascii subset, as the only
> matches I was using outside that range were always in the form of a
> class.  Otherwise, given N possibly overlapping matches, I built the
> continuation once for each subset of the target states.
>

That makes sense but in a general-purpose RE the following is legal input:

[-0-9\p{lowercase}]   # not (lowercase or ascii digits)

[a\P{lowercase}]      # 'a' or anything that is not lowercase.


note: \P{class} is the negation of the named class.


shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode RegExp Hell

Reply via email to