Re: [R] how to count the total number of (INCLUDING overl apping) occurrences of a substring within a string ?

2009-12-20 Thread Hans W Borchers
Gabor Grothendieck ggrothendieck at gmail.com writes:
 
 Use a zero lookaround expression.  It will not consume its match.  See ?regexp
 
  gregexpr(a(?=a), aaa, perl = TRUE)
 [[1]]
 [1] 1 2
 attr(,match.length)
 [1] 1 1

I wonder how you would count the number of occurrences of, for example,
'aba' or 'a.a' (*) in the string ababacababab using simple lookahead?

In Perl, there is a modifier '/g' to do that, and in Python one could
apply the function 'findall'.

When I had this task, I wrote a small function findall(), see below, but
I would be glad to see a solution with lookahead only.

Regards
Hans Werner

(*) or anything more complex


findall - function(apat, atxt) {
  stopifnot(length(apat) == 1, length(atxt) == 1)
  pos - c()  # positions of matches
  i - 1; n - nchar(atxt)
  found - regexpr(apat, substr(atxt, i, n), perl=TRUE)
  while (found  0) {
pos - c(pos, i + found - 1)
i - i + found
found - regexpr(apat, substr(atxt, i, n), perl=TRUE)
  }
  return(pos)
}


 On Sun, Dec 20, 2009 at 1:43 AM, Jonathan jonsleepy at gmail.com wrote:
  Last one for you guys:
 
  The command:
 
  length(gregexpr('cus','hocus pocus')[[1]])
  [1] 2
 
  returns the number of times the substring 'cus' appears in 'hocus pocus'
  (which is two)
 
  It's returning the number of **disjoint** matches.  So:
 
  length(gregexpr('aa','aaa')[[1]])
   [1] 1
 
  returns 1.
 
  **What I want to do:**
  I'm looking for a way to count all occurrences of the substring, including
  overlapping sets (so 'aa' would be found in 'aaa' two times, because the
  middle 'a' gets counted twice).
 
  Any ideas would be much appreciated!!
 
  Signing off and thanks for all the great assistance,
  Jonathan
 


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to count the total number of (INCLUDING overl apping) occurrences of a substring within a string ?

2009-12-20 Thread Hans W Borchers
Gabor Grothendieck ggrothendieck at gmail.com writes:
 
 Try this:
 
  findall(aba, ababacababab)
 [1] 1 3 7 9
  gregexpr(a(?=ba), ababacababab, perl = TRUE)
 [[1]]
 [1] 1 3 7 9
 attr(,match.length)
 [1] 1 1 1 1
 
  findall(a.a, ababacababab)
 [1] 1 3 5 7 9
  gregexpr(a(?=.a), ababacababab, perl = TRUE)
 [[1]]
 [1] 1 3 5 7 9
 attr(,match.length)
 [1] 1 1 1 1 1


Thanks --- somehow I did not realize that the expression in  ?=...
can also be regular.

My original problem was to find all three character matches where the
first and the last one are the same.  With  findall()  it works like:

findall((.).\\1, ababacababab)
# [1]  1  2  3  5  7  8  9 10

I am still not able to reproduce this with lookahead. Attempts with

gregexpr((.)?=.\\1, ababacababab, perl = TRUE)

do not work as the lookahead expression apparently does not know about
the captured group from before.

Regards
Hans Werner

Correction: I meant the '\G' metacharacter in Perl, not a modifier.


 On Sun, Dec 20, 2009 at 7:22 AM, Hans W Borchers
 hwborchers at googlemail.com wrote:
  Gabor Grothendieck ggrothendieck at gmail.com writes:
 
  [Sorry; Gmane forces me to delete more quoted text.]
 
  
     findall - function(apat, atxt) {
       stopifnot(length(apat) == 1, length(atxt) == 1)
       pos - c()  # positions of matches
       i - 1; n - nchar(atxt)
       found - regexpr(apat, substr(atxt, i, n), perl=TRUE)
       while (found  0) {
         pos - c(pos, i + found - 1)
         i - i + found
         found - regexpr(apat, substr(atxt, i, n), perl=TRUE)
       }
       return(pos)
     }
  
 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.