Greg Snow wrote: >> >>>> mystr <- paste(rep("1122", 10), collapse="") >>>> n <- nchar(mystr) >>>> >>>> mystr2 <- substr(rep(mystr,n), 1:n, n) >>>> >>>> tmp <- regexpr("^11221122", mystr2) >>>> (tmp + 1:n - 1)[tmp>0] >>>> >>>> >>> [1] 1 5 9 13 17 21 25 29 33 >>> >>> >>>> attr(tmp,"match.length")[tmp>0] >>>> >>>> >>> [1] 8 8 8 8 8 8 8 8 8 >>> >>> >>> >> while not exactly what i meant, this is an implementation of one of the >> approaches mentioned below, >> > > Yes > > >> ith care taken not to report duplicate >> matches: >> > > The use of regexpr rather than gregexpr and the '^' added to the beginning of > the pattern were included to prevent duplicate matches. This works for the > original example and all the cases that I can think of. If there is some way > for this strategy to find duplicate matches (without going to extra effort to > negate the effect of '^') I would be interested in learning about it. > > >
yes, the use of '^' was what i had in mind, and i'm not sure if there's a better solution. >>>> sequentially perform single matches on successive substrings of the >>>> input string (which can give you the same match more than once, >>>> though). >>>> >> one issue with your solution is that it allocates n substrings at the >> same time, which requires O(n^2) space (with n the length of the >> original string), but it may be faster than a for loop matching one >> substring at a time. >> > > I claimed it was 'one way' not the best. In fact I hope that there are > better ways, but I expect that other methods that are better in one way may > be worse in others. If the storage is an issue, do it in a loop rather than > creating all the substrings up front. Memory and time could also be saved by > not creating the substrings that are shorter than the matching pattern, I did > not do this because I am lazy and figure that the amount of time/memory saved > in this example is probably less than the time and effort needed to type in > the correction. > > If memory, speed, or other efficiencies are really that big an issue, then it > may be better to use a different tool (your perl code for example), but then > it moves beyond the scope of this list. > > I think the biggest issue with my solution is that it currently only works on > single strings and will not scale easily to finding all the matches in a > vector of strings. But without additional information on the real problem, > it is hard to know which of memory, time, scalability, etc. is the biggest > issue (if an issue at all) to address. > > did not say this was a bad solution. something like the following, which, as far as i get it, may avoid the O(n^2) space problem, consumes approx. an order of magnitude more time: # string - the tested string # pattern - the pattern, with '^' as needed n = nchar(string) sapply(1:n, function(i) regexpr(pattern, substr(string, i, n))) # filtering as needed vQ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.