subject:"\"Re\\\: \\\[Chicken\\\-users\\\] an oddly slow regex ...\""

Re: [Chicken-users] an oddly slow regex ...

2013-10-29 Thread J Altfas

FWIW, I tried the same regex in Tcl.  Using tclsh, I entered at the 
prompt:

% regexp -inline {[a-z][a-z0-9\\-_.]{0,20}} 
"a012345678901234567890123456789"
a01234567890123456789

It seemed to work fast, but to quantify execution time, the timed 
output of 10 iterations was "1.3823 microseconds per iteration".

The Tcl "hybrid" regex engine is well-described at 
http://wiki.tcl.tk/396 and http://wiki.tcl.tk/461.  The "nuts and 
bolts" of its design are beyond my level of comprehension, but those 
more familiar with the subject might find it interesting.

Jules Altfas.

> On Sat, 26 Oct 2013 20:38:12 +0200, Peter Bex 
 wrote:

> On Sat, Oct 26, 2013 at 10:37:36AM -0700, Matt Welland wrote:
> > This regex is so slow that you don't need a timer to see the 
impact (at
> > least not on my machine with chicken 4.8.0):
> > 
> >  (string-match "[a-z][a-z0-9\\-_.]{0,20}" 
"a012345678901234567890123456789")
> > 
> > Changing the {0,20} to + makes it run normally fast so I just 
replaced the
> > regex with a string-length and modified the "{0,20}" to "+" . I 
don't
> > necessarily need a fix for this but it seems like a possible 
symptom of a
> > deeper problem so I thought I'd report it.
> 
> Hi Matt,
> 
> Thanks for your report.  I'm afraid this is a known problem with
> Irregex - to avoid producing a state machine with too many states,
> it will always use a backtracking implementation for all 
repetition
> counts.
> 
> I think it's best to take a look at how to fix this upstream 
first.
> Maybe Alex has an idea of how to do that.
> 
> > Pre-compiling the regex didn't seem to make any difference.
> 
> That's because the backtracker matches really slowly.
> 
> > Just for completeness I compared with Ruby and the ruby 
equivalent
> > is (in human terms) instant.
> > 
> > "a012345678901234567890123456789".match(/[a-z][a-z0-9]{0,20}/)
> 
> Thanks for that!  We need more of such real-world test cases.
> Ideally I'd love to have a complete library of regex examples, 
which
> we can use to optimize irregex further.
> 
> By the way, I note that your Ruby regex isn't exactly the same as 
the
> one you used in CHICKEN.  Does it make a difference if you use the
> same regex?
> 
> Cheers,
> Peter
> -- 
> http://www.more-magic.net
> 
> ___
> Chicken-users mailing list
> Chicken-users@nongnu.org
> https://lists.nongnu.org/mailman/listinfo/chicken-users
>___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users

Re: [Chicken-users] an oddly slow regex ...

2013-10-27 Thread Alex Shinn

On Sun, Oct 27, 2013 at 3:38 AM, Peter Bex  wrote:

> On Sat, Oct 26, 2013 at 10:37:36AM -0700, Matt Welland wrote:
> > This regex is so slow that you don't need a timer to see the impact (at
> > least not on my machine with chicken 4.8.0):
> >
> >  (string-match "[a-z][a-z0-9\\-_.]{0,20}"
> "a012345678901234567890123456789")
> >
> > Changing the {0,20} to + makes it run normally fast so I just replaced
> the
> > regex with a string-length and modified the "{0,20}" to "+" . I don't
> > necessarily need a fix for this but it seems like a possible symptom of a
> > deeper problem so I thought I'd report it.
>
> Hi Matt,
>
> Thanks for your report.  I'm afraid this is a known problem with
> Irregex - to avoid producing a state machine with too many states,
> it will always use a backtracking implementation for all repetition
> counts.
>
> I think it's best to take a look at how to fix this upstream first.
> Maybe Alex has an idea of how to do that.
>

It's possible to expand repetition counts in the DFA.  Your
example basically becomes:

"[a-z][a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?"

which in this case probably has a reasonably sized DFA.
In other cases, these patterns can easily blow up and
force the backtracking fallback.  I'm not sure if it's better
to try more aggressively or to keep the easier rule of thumb
that n..m repetitions always force backtracking.

> Pre-compiling the regex didn't seem to make any difference.
>
> That's because the backtracker matches really slowly.
>

There is possibly some pathological backtracking happening
here, I'll take a look.

The long-term goal is to replace the backtracking with a
more scalable approach (e.g. the non-backtracking NFA
used in the SRFI 115 reference implementation) and only
use DFAs for simple patterns or when speed is explicitly
requested.  Doing this and preserving all PCRE features
will take time though (especially backrefs).

-- 
Alex
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users

Re: [Chicken-users] an oddly slow regex ...

2013-10-26 Thread Matt Welland

On Sat, Oct 26, 2013 at 11:38 AM, Peter Bex  wrote:

> On Sat, Oct 26, 2013 at 10:37:36AM -0700, Matt Welland wrote:
> > This regex is so slow that you don't need a timer to see the impact (at
> > least not on my machine with chicken 4.8.0):
> >
> >  (string-match "[a-z][a-z0-9\\-_.]{0,20}"
> "a012345678901234567890123456789")
> >
> > Changing the {0,20} to + makes it run normally fast so I just replaced
> the
> > regex with a string-length and modified the "{0,20}" to "+" . I don't
> > necessarily need a fix for this but it seems like a possible symptom of a
> > deeper problem so I thought I'd report it.
>
> Hi Matt,
>
> Thanks for your report.  I'm afraid this is a known problem with
> Irregex - to avoid producing a state machine with too many states,
> it will always use a backtracking implementation for all repetition
> counts.
>
> I think it's best to take a look at how to fix this upstream first.
> Maybe Alex has an idea of how to do that.
>
> > Pre-compiling the regex didn't seem to make any difference.
>
> That's because the backtracker matches really slowly.
>
> > Just for completeness I compared with Ruby and the ruby equivalent
> > is (in human terms) instant.
> >
> > "a012345678901234567890123456789".match(/[a-z][a-z0-9]{0,20}/)
>
> Thanks for that!  We need more of such real-world test cases.
> Ideally I'd love to have a complete library of regex examples, which
> we can use to optimize irregex further.
>
> By the way, I note that your Ruby regex isn't exactly the same as the
> one you used in CHICKEN.  Does it make a difference if you use the
> same regex?
>

Ah, yes, I did initially compare the exact same regex and saw no
discernible speed impact in Ruby. I did a few variations to see if there
was any sensitivity to ^ or $ at the beginning or end of the regex or
variations to the contents of the character group. This seems to be pretty
much a function of the match count {0,20}.

Some coding for performance hints on the regex page would likely be helpful.



>
> Cheers,
> Peter
> --
> http://www.more-magic.net
>



-- 
Matt
-=-
90% of the nations wealth is held by 2% of the people. Bummer to be in the
majority...
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users

Re: [Chicken-users] an oddly slow regex ...

2013-10-26 Thread Peter Bex

On Sat, Oct 26, 2013 at 10:37:36AM -0700, Matt Welland wrote:
> This regex is so slow that you don't need a timer to see the impact (at
> least not on my machine with chicken 4.8.0):
> 
>  (string-match "[a-z][a-z0-9\\-_.]{0,20}" "a012345678901234567890123456789")
> 
> Changing the {0,20} to + makes it run normally fast so I just replaced the
> regex with a string-length and modified the "{0,20}" to "+" . I don't
> necessarily need a fix for this but it seems like a possible symptom of a
> deeper problem so I thought I'd report it.

Hi Matt,

Thanks for your report.  I'm afraid this is a known problem with
Irregex - to avoid producing a state machine with too many states,
it will always use a backtracking implementation for all repetition
counts.

I think it's best to take a look at how to fix this upstream first.
Maybe Alex has an idea of how to do that.

> Pre-compiling the regex didn't seem to make any difference.

That's because the backtracker matches really slowly.

> Just for completeness I compared with Ruby and the ruby equivalent
> is (in human terms) instant.
> 
> "a012345678901234567890123456789".match(/[a-z][a-z0-9]{0,20}/)

Thanks for that!  We need more of such real-world test cases.
Ideally I'd love to have a complete library of regex examples, which
we can use to optimize irregex further.

By the way, I note that your Ruby regex isn't exactly the same as the
one you used in CHICKEN.  Does it make a difference if you use the
same regex?

Cheers,
Peter
-- 
http://www.more-magic.net

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users

Re: [Chicken-users] an oddly slow regex ...

Re: [Chicken-users] an oddly slow regex ...

Re: [Chicken-users] an oddly slow regex ...

Re: [Chicken-users] an oddly slow regex ...

4 matches

Site Navigation

Mail list logo

Footer information