Re: [bitc-dev] Unicode RegExp Hell

William ML Leslie Wed, 23 Apr 2014 22:58:25 -0700

On 23 April 2014 23:31, Jonathan S. Shapiro <[email protected]> wrote:
> On Tue, Apr 22, 2014 at 10:38 PM, William ML Leslie
> <[email protected]> wrote:
>> > The L class is the class of lowercase unicode characters. The transition
>> > on
>> > 'a' really needs to go to a different state than the transition on
>> > \p{L}.
>> > This means that the DFA converter actually needs to be able to unpack
>> > unicode classes. At the moment, I'm not even preserving them in the NFA.
>> > I
>> > go ahead and expand them into their constituent subranges. This makes
>> > character class negation easier in any case.
>>
>> I had been expanding them only for the ascii subset, as the only
>> matches I was using outside that range were always in the form of a
>> class.  Otherwise, given N possibly overlapping matches, I built the
>> continuation once for each subset of the target states.
>
>
> That makes sense but in a general-purpose RE the following is legal input:
>
> [-0-9\p{lowercase}]   # not (lowercase or ascii digits)
>
> [a\P{lowercase}]      # 'a' or anything that is not lowercase.


These happen to work for me by happy accident, I had been testing each
ascii character against the set and expanding them when compiling from
DFA into code.  Once you have handled all ascii characters, all you
have left is [-\p{lowercase}] and [\P{lowercase}] respectively.  Now,
if you'd replaced 0 and 9 with alpha and omega, I would not have
correctly handled that, and a general purpose RE engine would have to.

It's not that it's a hard thing to do, but until I moved the
optimisation into the DFA -> code pass, I wasn't confident of how to
implement such sets and make them efficient. (The answer was, the same
as sets within the ascii range.)

> note: \P{class} is the negation of the named class.

Thanks for describing.  I knew that, but there is so much variation
between regex syntaxes.  For example, I would not have picked up the
leading "-" as negation - I am used to ^ for that purpose.

-- 
William Leslie

Notice:
Likely much of this email is, by the nature of copyright, covered
under copyright law.  You absolutely may reproduce any part of it in
accordance with the copyright law of the nation you are reading this
in.  Any attempt to deny you those rights would be illegal without
prior contractual agreement.
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Unicode RegExp Hell

Reply via email to