Re: [Chicken-users] an oddly slow regex ...

2013-10-29 Thread J Altfas
FWIW, I tried the same regex in Tcl.  Using tclsh, I entered at the 
prompt:

% regexp -inline {[a-z][a-z0-9\\-_.]{0,20}} 
a012345678901234567890123456789
a01234567890123456789

It seemed to work fast, but to quantify execution time, the timed 
output of 10 iterations was 1.3823 microseconds per iteration.

The Tcl hybrid regex engine is well-described at 
http://wiki.tcl.tk/396 and http://wiki.tcl.tk/461.  The nuts and 
bolts of its design are beyond my level of comprehension, but those 
more familiar with the subject might find it interesting.

Jules Altfas.

 On Sat, 26 Oct 2013 20:38:12 +0200, Peter Bex 
peter@xs4all.nl wrote:

 On Sat, Oct 26, 2013 at 10:37:36AM -0700, Matt Welland wrote:
  This regex is so slow that you don't need a timer to see the 
impact (at
  least not on my machine with chicken 4.8.0):
  
   (string-match [a-z][a-z0-9\\-_.]{0,20} 
a012345678901234567890123456789)
  
  Changing the {0,20} to + makes it run normally fast so I just 
replaced the
  regex with a string-length and modified the {0,20} to + . I 
don't
  necessarily need a fix for this but it seems like a possible 
symptom of a
  deeper problem so I thought I'd report it.
 
 Hi Matt,
 
 Thanks for your report.  I'm afraid this is a known problem with
 Irregex - to avoid producing a state machine with too many states,
 it will always use a backtracking implementation for all 
repetition
 counts.
 
 I think it's best to take a look at how to fix this upstream 
first.
 Maybe Alex has an idea of how to do that.
 
  Pre-compiling the regex didn't seem to make any difference.
 
 That's because the backtracker matches really slowly.
 
  Just for completeness I compared with Ruby and the ruby 
equivalent
  is (in human terms) instant.
  
  a012345678901234567890123456789.match(/[a-z][a-z0-9]{0,20}/)
 
 Thanks for that!  We need more of such real-world test cases.
 Ideally I'd love to have a complete library of regex examples, 
which
 we can use to optimize irregex further.
 
 By the way, I note that your Ruby regex isn't exactly the same as 
the
 one you used in CHICKEN.  Does it make a difference if you use the
 same regex?
 
 Cheers,
 Peter
 -- 
 http://www.more-magic.net
 
 ___
 Chicken-users mailing list
 Chicken-users@nongnu.org
 https://lists.nongnu.org/mailman/listinfo/chicken-users
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] an oddly slow regex ...

2013-10-27 Thread Alex Shinn
On Sun, Oct 27, 2013 at 3:38 AM, Peter Bex peter@xs4all.nl wrote:

 On Sat, Oct 26, 2013 at 10:37:36AM -0700, Matt Welland wrote:
  This regex is so slow that you don't need a timer to see the impact (at
  least not on my machine with chicken 4.8.0):
 
   (string-match [a-z][a-z0-9\\-_.]{0,20}
 a012345678901234567890123456789)
 
  Changing the {0,20} to + makes it run normally fast so I just replaced
 the
  regex with a string-length and modified the {0,20} to + . I don't
  necessarily need a fix for this but it seems like a possible symptom of a
  deeper problem so I thought I'd report it.

 Hi Matt,

 Thanks for your report.  I'm afraid this is a known problem with
 Irregex - to avoid producing a state machine with too many states,
 it will always use a backtracking implementation for all repetition
 counts.

 I think it's best to take a look at how to fix this upstream first.
 Maybe Alex has an idea of how to do that.


It's possible to expand repetition counts in the DFA.  Your
example basically becomes:

[a-z][a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?

which in this case probably has a reasonably sized DFA.
In other cases, these patterns can easily blow up and
force the backtracking fallback.  I'm not sure if it's better
to try more aggressively or to keep the easier rule of thumb
that n..m repetitions always force backtracking.

 Pre-compiling the regex didn't seem to make any difference.

 That's because the backtracker matches really slowly.


There is possibly some pathological backtracking happening
here, I'll take a look.

The long-term goal is to replace the backtracking with a
more scalable approach (e.g. the non-backtracking NFA
used in the SRFI 115 reference implementation) and only
use DFAs for simple patterns or when speed is explicitly
requested.  Doing this and preserving all PCRE features
will take time though (especially backrefs).

-- 
Alex
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] an oddly slow regex ...

2013-10-26 Thread Matt Welland
On Sat, Oct 26, 2013 at 11:38 AM, Peter Bex peter@xs4all.nl wrote:

 On Sat, Oct 26, 2013 at 10:37:36AM -0700, Matt Welland wrote:
  This regex is so slow that you don't need a timer to see the impact (at
  least not on my machine with chicken 4.8.0):
 
   (string-match [a-z][a-z0-9\\-_.]{0,20}
 a012345678901234567890123456789)
 
  Changing the {0,20} to + makes it run normally fast so I just replaced
 the
  regex with a string-length and modified the {0,20} to + . I don't
  necessarily need a fix for this but it seems like a possible symptom of a
  deeper problem so I thought I'd report it.

 Hi Matt,

 Thanks for your report.  I'm afraid this is a known problem with
 Irregex - to avoid producing a state machine with too many states,
 it will always use a backtracking implementation for all repetition
 counts.

 I think it's best to take a look at how to fix this upstream first.
 Maybe Alex has an idea of how to do that.

  Pre-compiling the regex didn't seem to make any difference.

 That's because the backtracker matches really slowly.

  Just for completeness I compared with Ruby and the ruby equivalent
  is (in human terms) instant.
 
  a012345678901234567890123456789.match(/[a-z][a-z0-9]{0,20}/)

 Thanks for that!  We need more of such real-world test cases.
 Ideally I'd love to have a complete library of regex examples, which
 we can use to optimize irregex further.

 By the way, I note that your Ruby regex isn't exactly the same as the
 one you used in CHICKEN.  Does it make a difference if you use the
 same regex?


Ah, yes, I did initially compare the exact same regex and saw no
discernible speed impact in Ruby. I did a few variations to see if there
was any sensitivity to ^ or $ at the beginning or end of the regex or
variations to the contents of the character group. This seems to be pretty
much a function of the match count {0,20}.

Some coding for performance hints on the regex page would likely be helpful.




 Cheers,
 Peter
 --
 http://www.more-magic.net




-- 
Matt
-=-
90% of the nations wealth is held by 2% of the people. Bummer to be in the
majority...
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users