On Thu, Oct 18, 2012 at 2:51 AM, Matthew Kerwin <[email protected]> wrote:
> Tangentially, it just occurred to me that ruby's regular expression
> engine does the same thing that javascript's does, when globally
> replacing /X*$/ .
This behavior is common with most regexp engines (at least I don't
know any which does _not_ behave like this). All regular expressions
X* can match the empty string - anywhere in the input.
irb(main):022:0> "####".scan /\w*/
=> ["", "", "", "", ""]
And, when anchoring a portion of the match expression at the end and
have repetition in that match you need to make sure that the
characters are not eaten by other parts of the regexp.
"naive" approach:
irb(main):026:0> %w{aaa aab abb bbb}.each {|s| /.*(b*)\z/ =~ s; printf
"%p: 1:%p\n", s, $1}
"aaa": 1:""
"aab": 1:""
"abb": 1:""
"bbb": 1:""
=> ["aaa", "aab", "abb", "bbb"]
Working approaches:
1. reduce greed
irb(main):027:0> %w{aaa aab abb bbb}.each {|s| /.*?(b*)\z/ =~ s;
printf "%p: 1:%p\n", s, $1}
"aaa": 1:""
"aab": 1:"b"
"abb": 1:"bb"
"bbb": 1:"bbb"
=> ["aaa", "aab", "abb", "bbb"]
2. negative lookbehind
irb(main):028:0> %w{aaa aab abb bbb}.each {|s| /.*(?<!b)(b*)\z/ =~ s;
printf "%p: 1:%p\n", s, $1}
"aaa": 1:""
"aab": 1:"b"
"abb": 1:"bb"
"bbb": 1:"bbb"
=> ["aaa", "aab", "abb", "bbb"]
Note though the special case where there is only one alternative with
a match anchored at the end:
irb(main):045:0> for b in body; for pre in segm; for post in segm;
s="#{pre}#{b}#{post}"; printf "%p -> %p\n",s,s[/#*\z/]; end end end
"" -> ""
"#" -> "#"
"##" -> "##"
"#" -> "#"
"##" -> "##"
"###" -> "###"
"##" -> "##"
"###" -> "###"
"####" -> "####"
"foo" -> ""
"foo#" -> "#"
"foo##" -> "##"
"#foo" -> ""
"#foo#" -> "#"
"#foo##" -> "##"
"##foo" -> ""
"##foo#" -> "#"
"##foo##" -> "##"
=> ["", "foo"]
Here, the simple expression works since the # are not eaten by other
portions of the regexp.
> It arose when someone wanted to replace any number
> (or none) of a character at the start and end of a string with exactly
> one of that character.
>
> irb(main):001:0> 'foo'.gsub(/\A#*|#*\Z/, '#')
> => "#foo#"
> irb(main):002:0> '#foo'.gsub(/\A#*|#*\Z/, '#')
> => "#foo#"
> irb(main):003:0> '##foo'.gsub(/\A#*|#*\Z/, '#')
> => "#foo#"
> irb(main):004:0> 'foo#'.gsub(/\A#*|#*\Z/, '#')
> => "#foo##"
> irb(main):005:0> 'foo##'.gsub(/\A#*|#*\Z/, '#')
> => "#foo##"
> irb(main):006:0> '##foo##'.gsub(/\A#*|#*\Z/, '#')
> => "#foo##"
If one regexp should be used in this case the negative lookbehind is a
viable option since there is no preceding part in this alternative
which we can make non greedy:
irb(main):044:0> for b in body; for pre in segm; for post in segm;
s="#{pre}#{b}#{post}"; printf "%p -> %p\n",s,s.gsub(/\A#*|(?<!#)#*\z/,
'#'); end end end
"" -> "#"
"#" -> "#"
"##" -> "#"
"#" -> "#"
"##" -> "#"
"###" -> "#"
"##" -> "#"
"###" -> "#"
"####" -> "#"
"foo" -> "#foo#"
"foo#" -> "#foo#"
"foo##" -> "#foo#"
"#foo" -> "#foo#"
"#foo#" -> "#foo#"
"#foo##" -> "#foo#"
"##foo" -> "#foo#"
"##foo#" -> "#foo#"
"##foo##" -> "#foo#"
=> ["", "foo"]
> I blogged about it here:
> http://matthew.kerwin.net.au/blog/20110608_javascript_global_regexp
Turns out with Oniguruma there *is* a way to do it with a single
regexp. In fact any regexp engine with lookbehind will do.
Reference: http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt
Kind regards
robert
--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
-- You received this message because you are subscribed to the Google Groups
ruby-talk-google group. To post to this group, send email to
[email protected]. To unsubscribe from this group, send email
to [email protected]. For more options, visit this
group at https://groups.google.com/d/forum/ruby-talk-google?hl=en