Re: [R] [Rd] gregexpr - match overlap mishandled (PR#13391)

Wacek Kusnierczyk Mon, 15 Dec 2008 10:30:02 -0800

Greg Snow wrote:
>>
>>>> mystr <- paste(rep("1122", 10), collapse="")
>>>> n <- nchar(mystr)
>>>>
>>>> mystr2 <- substr(rep(mystr,n), 1:n, n)
>>>>
>>>> tmp <- regexpr("^11221122", mystr2)
>>>> (tmp + 1:n - 1)[tmp>0]
>>>>
>>>>         
>>> [1]  1  5  9 13 17 21 25 29 33
>>>
>>>       
>>>> attr(tmp,"match.length")[tmp>0]
>>>>
>>>>         
>>> [1] 8 8 8 8 8 8 8 8 8
>>>
>>>
>>>       
>> while not exactly what i meant, this is an implementation of one of the
>> approaches mentioned below,
>>     
>
> Yes
>
>   
>>  ith care taken not to report duplicate
>> matches:
>>     
>
> The use of regexpr rather than gregexpr and the '^' added to the beginning of 
> the pattern were included to prevent duplicate matches.  This works for the 
> original example and all the cases that I can think of.  If there is some way 
> for this strategy to find duplicate matches (without going to extra effort to 
> negate the effect of '^') I would be interested in learning about it.
>
>
>


yes, the use of '^' was what i had in mind, and i'm not sure if there's
a better solution. 


>>>> sequentially perform single matches on successive substrings of the
>>>> input string (which can give you the same match more than once,
>>>> though).
>>>>         
>> one issue with your solution is that it allocates n substrings at the
>> same time, which requires O(n^2) space (with n the length of the
>> original string), but it may be faster than a for loop matching one
>> substring at a time.
>>     
>
> I claimed it was 'one way' not the best.  In fact I hope that there are 
> better ways, but I expect that other methods that are better in one way may 
> be worse in others.  If the storage is an issue, do it in a loop rather than 
> creating all the substrings up front.  Memory and time could also be saved by 
> not creating the substrings that are shorter than the matching pattern, I did 
> not do this because I am lazy and figure that the amount of time/memory saved 
> in this example is probably less than the time and effort needed to type in 
> the correction.
>
> If memory, speed, or other efficiencies are really that big an issue, then it 
> may be better to use a different tool (your perl code for example), but then 
> it moves beyond the scope of this list.
>
> I think the biggest issue with my solution is that it currently only works on 
> single strings and will not scale easily to finding all the matches in a 
> vector of strings.  But without additional information on the real problem, 
> it is hard to know which of memory, time, scalability, etc. is the biggest 
> issue (if an issue at all) to address.
>
>   

did not say this was a bad solution.  something like the following,
which, as far as i get it, may avoid the O(n^2) space problem, consumes
approx. an order of magnitude more time:

# string - the tested string
# pattern - the pattern, with '^' as needed

n = nchar(string)
sapply(1:n, function(i) regexpr(pattern, substr(string, i, n)))
# filtering as needed


vQ

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [Rd] gregexpr - match overlap mishandled (PR#13391)

Reply via email to