On Fri, Feb 22, 2013 at 1:21 AM, Tikhon B. <[email protected]> wrote:
> Robert Klemme wrote in post #1098199:
>> On Thu, Feb 21, 2013 at 12:12 PM, Tikhon B. <[email protected]>
>> wrote:
>> You are doing the measurement wrong. You do not only measure matching
>> but also creation of the String. Given that fact, all your results
>> are moot.
>
> I actively considered this matter when submitting the post, but in the
> end I included the string creating in the test to allow anyone to try
> out the example by copying one line.
Having a second line which defines a variable or constant isn't really
that more inconvenient.
> The time necessary for the actual
> creation process is negligible; I need to run the string creation ten
> thousand times before the result is substantial enough to significantly
> affect the numbers I'm getting.
Still you are measuring one thing but reasoning about another.
> I am not sure what sort of proper and complete tests you are looking
> for.
Tests which measure exactly the functionality you want to scrutinize
not something else.
> Regarding the actual slowdown, the issue comes down to the engine
> repeating the match from the start for every single "a" in the string,
> then failing and trying again with the next letter.
Backtracking.
> I do not see how a
> greedy modifier can help in this situation, and I do not know enough
> about the ruby regex engine to comment on the underlying design.
Mine was a general statement about issues that can be avoided with
greediness modifiers. I didn't say it's the solution to all
performance issues of this kind.
I had a closer look at your expressions. They are of course
artificial to demonstrate the effects of backtracking. I would assume
that one would typically write different expressions in real
applications, for example, using anchoring which dramatically improves
the situation here.
$ ruby rx-bm.rb
user system total real
0 /(?>a[^b]*b)/ 11.732000 0.000000 11.732000 ( 11.737671)
1 /(?>a[^b]*b)/ 0.015000 0.000000 0.015000 ( 0.003000)
0 /\A(?>a[^b]*b)/ 0.000000 0.000000 0.000000 ( 0.002000)
1 /\A(?>a[^b]*b)/ 0.000000 0.000000 0.000000 ( 0.003000)
0 /(?<!a)(?>a[^b]*b)/ 0.016000 0.000000 0.016000 ( 0.012001)
1 /(?<!a)(?>a[^b]*b)/ 0.000000 0.000000 0.000000 ( 0.003000)
0 /aa*b/ 11.700000 0.000000 11.700000 ( 11.703670)
1 /aa*b/ 0.000000 0.000000 0.000000 ( 0.002000)
0 /\Aaa*b/ 0.000000 0.000000 0.000000 ( 0.002000)
1 /\Aaa*b/ 0.015000 0.000000 0.015000 ( 0.002000)
0 /(?<!a)aa*b/ 0.000000 0.000000 0.000000 ( 0.013001)
1 /(?<!a)aa*b/ 0.016000 0.000000 0.016000 ( 0.002000)
0 /a+b/ 11.248000 0.000000 11.248000 ( 11.252644)
1 /a+b/ 0.000000 0.000000 0.000000 ( 0.002000)
0 /\Aa+b/ 0.000000 0.000000 0.000000 ( 0.003000)
1 /\Aa+b/ 0.000000 0.000000 0.000000 ( 0.002000)
0 /(?<!a)a+b/ 0.015000 0.000000 0.015000 ( 0.012001)
1 /(?<!a)a+b/ 0.000000 0.000000 0.000000 ( 0.003000)
0 /a.*b/ 4.649000 0.000000 4.649000 ( 4.641265)
1 /a.*b/ 0.000000 0.000000 0.000000 ( 0.001000)
0 /\Aa.*b/ 0.000000 0.000000 0.000000 ( 0.001000)
1 /\Aa.*b/ 0.000000 0.000000 0.000000 ( 0.001000)
0 /(?<!a)a.*b/ 0.016000 0.000000 0.016000 ( 0.011001)
1 /(?<!a)a.*b/ 0.000000 0.000000 0.000000 ( 0.000000)
$ cat -n rx-bm.rb
1
2 require 'benchmark'
3
4 strings = [
5 ("a" * 10_000).freeze,
6 ("a" * 10_000 + "b").freeze
7 ]
8
9 data = [
10 /(?>a[^b]*b)/,
11 /\A(?>a[^b]*b)/,
12 /(?<!a)(?>a[^b]*b)/,
13
14 /aa*b/,
15 /\Aaa*b/,
16 /(?<!a)aa*b/,
17
18 /a+b/,
19 /\Aa+b/,
20 /(?<!a)a+b/,
21
22 /a.*b/,
23 /\Aa.*b/,
24 /(?<!a)a.*b/,
25 ]
26
27 REP = 20
28
29 Benchmark.bm 25 do |x|
30 data.each do |rx|
31 strings.each_with_index do |s, i|
32 x.report sprintf("%2d %p", i, rx) do
33 REP.times do
34 rx.match s
35 end
36 end
37 end
38 end
39 end
$
So, yes, there is a cost for backtracking involved, but it only shows
only in particular situations:
- large inputs
- specific expressions (absence of anchoring, using unlimited
repetition operators vs. limited (e.g. /a{0,10}/).
The phenomenon is well known with regular expression engines I
believe. The cases you mention seem to be rather the exception than
the common state of affairs. Which doesn't mean that Ruby's engine
cannot be improved. I just don't think it's not that dramatic an
issue.
Kind regards
robert
--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
--
[email protected] |
https://groups.google.com/d/forum/ruby-talk-google?hl=en
---
You received this message because you are subscribed to the Google Groups
"ruby-talk-google" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.