Re: [R] how to count the total number of (INCLUDING overl apping) occurrences of a substring within a string ?
Gabor Grothendieck ggrothendieck at gmail.com writes: Use a zero lookaround expression. It will not consume its match. See ?regexp gregexpr(a(?=a), aaa, perl = TRUE) [[1]] [1] 1 2 attr(,match.length) [1] 1 1 I wonder how you would count the number of occurrences of, for example, 'aba' or 'a.a' (*) in the string ababacababab using simple lookahead? In Perl, there is a modifier '/g' to do that, and in Python one could apply the function 'findall'. When I had this task, I wrote a small function findall(), see below, but I would be glad to see a solution with lookahead only. Regards Hans Werner (*) or anything more complex findall - function(apat, atxt) { stopifnot(length(apat) == 1, length(atxt) == 1) pos - c() # positions of matches i - 1; n - nchar(atxt) found - regexpr(apat, substr(atxt, i, n), perl=TRUE) while (found 0) { pos - c(pos, i + found - 1) i - i + found found - regexpr(apat, substr(atxt, i, n), perl=TRUE) } return(pos) } On Sun, Dec 20, 2009 at 1:43 AM, Jonathan jonsleepy at gmail.com wrote: Last one for you guys: The command: length(gregexpr('cus','hocus pocus')[[1]]) [1] 2 returns the number of times the substring 'cus' appears in 'hocus pocus' (which is two) It's returning the number of **disjoint** matches. So: length(gregexpr('aa','aaa')[[1]]) [1] 1 returns 1. **What I want to do:** I'm looking for a way to count all occurrences of the substring, including overlapping sets (so 'aa' would be found in 'aaa' two times, because the middle 'a' gets counted twice). Any ideas would be much appreciated!! Signing off and thanks for all the great assistance, Jonathan __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to count the total number of (INCLUDING overl apping) occurrences of a substring within a string ?
Gabor Grothendieck ggrothendieck at gmail.com writes: Try this: findall(aba, ababacababab) [1] 1 3 7 9 gregexpr(a(?=ba), ababacababab, perl = TRUE) [[1]] [1] 1 3 7 9 attr(,match.length) [1] 1 1 1 1 findall(a.a, ababacababab) [1] 1 3 5 7 9 gregexpr(a(?=.a), ababacababab, perl = TRUE) [[1]] [1] 1 3 5 7 9 attr(,match.length) [1] 1 1 1 1 1 Thanks --- somehow I did not realize that the expression in ?=... can also be regular. My original problem was to find all three character matches where the first and the last one are the same. With findall() it works like: findall((.).\\1, ababacababab) # [1] 1 2 3 5 7 8 9 10 I am still not able to reproduce this with lookahead. Attempts with gregexpr((.)?=.\\1, ababacababab, perl = TRUE) do not work as the lookahead expression apparently does not know about the captured group from before. Regards Hans Werner Correction: I meant the '\G' metacharacter in Perl, not a modifier. On Sun, Dec 20, 2009 at 7:22 AM, Hans W Borchers hwborchers at googlemail.com wrote: Gabor Grothendieck ggrothendieck at gmail.com writes: [Sorry; Gmane forces me to delete more quoted text.] findall - function(apat, atxt) { stopifnot(length(apat) == 1, length(atxt) == 1) pos - c() # positions of matches i - 1; n - nchar(atxt) found - regexpr(apat, substr(atxt, i, n), perl=TRUE) while (found 0) { pos - c(pos, i + found - 1) i - i + found found - regexpr(apat, substr(atxt, i, n), perl=TRUE) } return(pos) } __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.