Re: [Chicken-users] an oddly slow regex ...
FWIW, I tried the same regex in Tcl. Using tclsh, I entered at the prompt: % regexp -inline {[a-z][a-z0-9\\-_.]{0,20}} "a012345678901234567890123456789" a01234567890123456789 It seemed to work fast, but to quantify execution time, the timed output of 10 iterations was "1.3823 microseconds per iteration". The Tcl "hybrid" regex engine is well-described at http://wiki.tcl.tk/396 and http://wiki.tcl.tk/461. The "nuts and bolts" of its design are beyond my level of comprehension, but those more familiar with the subject might find it interesting. Jules Altfas. > On Sat, 26 Oct 2013 20:38:12 +0200, Peter Bex wrote: > On Sat, Oct 26, 2013 at 10:37:36AM -0700, Matt Welland wrote: > > This regex is so slow that you don't need a timer to see the impact (at > > least not on my machine with chicken 4.8.0): > > > > (string-match "[a-z][a-z0-9\\-_.]{0,20}" "a012345678901234567890123456789") > > > > Changing the {0,20} to + makes it run normally fast so I just replaced the > > regex with a string-length and modified the "{0,20}" to "+" . I don't > > necessarily need a fix for this but it seems like a possible symptom of a > > deeper problem so I thought I'd report it. > > Hi Matt, > > Thanks for your report. I'm afraid this is a known problem with > Irregex - to avoid producing a state machine with too many states, > it will always use a backtracking implementation for all repetition > counts. > > I think it's best to take a look at how to fix this upstream first. > Maybe Alex has an idea of how to do that. > > > Pre-compiling the regex didn't seem to make any difference. > > That's because the backtracker matches really slowly. > > > Just for completeness I compared with Ruby and the ruby equivalent > > is (in human terms) instant. > > > > "a012345678901234567890123456789".match(/[a-z][a-z0-9]{0,20}/) > > Thanks for that! We need more of such real-world test cases. > Ideally I'd love to have a complete library of regex examples, which > we can use to optimize irregex further. > > By the way, I note that your Ruby regex isn't exactly the same as the > one you used in CHICKEN. Does it make a difference if you use the > same regex? > > Cheers, > Peter > -- > http://www.more-magic.net > > ___ > Chicken-users mailing list > Chicken-users@nongnu.org > https://lists.nongnu.org/mailman/listinfo/chicken-users >___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] an oddly slow regex ...
On Sun, Oct 27, 2013 at 3:38 AM, Peter Bex wrote: > On Sat, Oct 26, 2013 at 10:37:36AM -0700, Matt Welland wrote: > > This regex is so slow that you don't need a timer to see the impact (at > > least not on my machine with chicken 4.8.0): > > > > (string-match "[a-z][a-z0-9\\-_.]{0,20}" > "a012345678901234567890123456789") > > > > Changing the {0,20} to + makes it run normally fast so I just replaced > the > > regex with a string-length and modified the "{0,20}" to "+" . I don't > > necessarily need a fix for this but it seems like a possible symptom of a > > deeper problem so I thought I'd report it. > > Hi Matt, > > Thanks for your report. I'm afraid this is a known problem with > Irregex - to avoid producing a state machine with too many states, > it will always use a backtracking implementation for all repetition > counts. > > I think it's best to take a look at how to fix this upstream first. > Maybe Alex has an idea of how to do that. > It's possible to expand repetition counts in the DFA. Your example basically becomes: "[a-z][a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?[a-z0-9\\-_.]?" which in this case probably has a reasonably sized DFA. In other cases, these patterns can easily blow up and force the backtracking fallback. I'm not sure if it's better to try more aggressively or to keep the easier rule of thumb that n..m repetitions always force backtracking. > Pre-compiling the regex didn't seem to make any difference. > > That's because the backtracker matches really slowly. > There is possibly some pathological backtracking happening here, I'll take a look. The long-term goal is to replace the backtracking with a more scalable approach (e.g. the non-backtracking NFA used in the SRFI 115 reference implementation) and only use DFAs for simple patterns or when speed is explicitly requested. Doing this and preserving all PCRE features will take time though (especially backrefs). -- Alex ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] an oddly slow regex ...
On Sat, Oct 26, 2013 at 11:38 AM, Peter Bex wrote: > On Sat, Oct 26, 2013 at 10:37:36AM -0700, Matt Welland wrote: > > This regex is so slow that you don't need a timer to see the impact (at > > least not on my machine with chicken 4.8.0): > > > > (string-match "[a-z][a-z0-9\\-_.]{0,20}" > "a012345678901234567890123456789") > > > > Changing the {0,20} to + makes it run normally fast so I just replaced > the > > regex with a string-length and modified the "{0,20}" to "+" . I don't > > necessarily need a fix for this but it seems like a possible symptom of a > > deeper problem so I thought I'd report it. > > Hi Matt, > > Thanks for your report. I'm afraid this is a known problem with > Irregex - to avoid producing a state machine with too many states, > it will always use a backtracking implementation for all repetition > counts. > > I think it's best to take a look at how to fix this upstream first. > Maybe Alex has an idea of how to do that. > > > Pre-compiling the regex didn't seem to make any difference. > > That's because the backtracker matches really slowly. > > > Just for completeness I compared with Ruby and the ruby equivalent > > is (in human terms) instant. > > > > "a012345678901234567890123456789".match(/[a-z][a-z0-9]{0,20}/) > > Thanks for that! We need more of such real-world test cases. > Ideally I'd love to have a complete library of regex examples, which > we can use to optimize irregex further. > > By the way, I note that your Ruby regex isn't exactly the same as the > one you used in CHICKEN. Does it make a difference if you use the > same regex? > Ah, yes, I did initially compare the exact same regex and saw no discernible speed impact in Ruby. I did a few variations to see if there was any sensitivity to ^ or $ at the beginning or end of the regex or variations to the contents of the character group. This seems to be pretty much a function of the match count {0,20}. Some coding for performance hints on the regex page would likely be helpful. > > Cheers, > Peter > -- > http://www.more-magic.net > -- Matt -=- 90% of the nations wealth is held by 2% of the people. Bummer to be in the majority... ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] an oddly slow regex ...
On Sat, Oct 26, 2013 at 10:37:36AM -0700, Matt Welland wrote: > This regex is so slow that you don't need a timer to see the impact (at > least not on my machine with chicken 4.8.0): > > (string-match "[a-z][a-z0-9\\-_.]{0,20}" "a012345678901234567890123456789") > > Changing the {0,20} to + makes it run normally fast so I just replaced the > regex with a string-length and modified the "{0,20}" to "+" . I don't > necessarily need a fix for this but it seems like a possible symptom of a > deeper problem so I thought I'd report it. Hi Matt, Thanks for your report. I'm afraid this is a known problem with Irregex - to avoid producing a state machine with too many states, it will always use a backtracking implementation for all repetition counts. I think it's best to take a look at how to fix this upstream first. Maybe Alex has an idea of how to do that. > Pre-compiling the regex didn't seem to make any difference. That's because the backtracker matches really slowly. > Just for completeness I compared with Ruby and the ruby equivalent > is (in human terms) instant. > > "a012345678901234567890123456789".match(/[a-z][a-z0-9]{0,20}/) Thanks for that! We need more of such real-world test cases. Ideally I'd love to have a complete library of regex examples, which we can use to optimize irregex further. By the way, I note that your Ruby regex isn't exactly the same as the one you used in CHICKEN. Does it make a difference if you use the same regex? Cheers, Peter -- http://www.more-magic.net ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users