On Oct 8, 2010, at 6:42 PM, Lorenzo Isella wrote:

Thanks for lending a helping hand.
I put together a self-contained example. Basically, it all relies on a couple of functions, where one function simply iterates the application of the other function. I am trying to implement the so-called Lempel-Ziv entropy estimator. The idea is to choose a position i along a string x (standing for a time series) and find the length of the shortest string starting from i which has never occurred before i. Please find below the R snippet which requires an input file (a simple text file) you can download from

http://dl.dropbox.com/u/5685598/time_series25_.dat

What puzzles me is that the list is not really long (less than 2000 entries) and I have not experienced the same problem even with longer lists.

But maybe your loop terminated in them eaarlier/ Someplace between 11*225 and 11*240 the grepping machine gives up:

> eprs <- paste(rep("aaaaaaaaaa", 225), collapse="#")
> grepl(eprs, eprs)
[1] TRUE

> eprs <- paste(rep("aaaaaaaaaa", 240), collapse="#")
> grepl(eprs, eprs)
Error in grepl(eprs, eprs) :
invalid regular expression 'aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaaaaaaa#aaaaa
In addition: Warning message:
In grepl(eprs, eprs) : regcomp error:  'Out of memory'

The complexity of the problem may depend on the distribution of values. You have a very skewed distribution with the vast majority being in the same value as appeared in your error message :

> table(x)
x
 12653a6 202fbcc4 48bef8c3 4e084ddc 51f342a4 5d64d58a 78087f5e abddf3d1
    1419      299        1        1        1        3        1        1
ac76183b b955be36 c600173a e96f6bbd e9c56275
       1       30        5        1        9

And you have 1159 of them in one clump (which would seem to be somewhat improbably under a random null hypothesis:

> max(rle(x)$lengths)
[1] 1159
> which(rle(x)$lengths == 1159)
[1] 123
> rle(x)$values[123]
[1] "12653a6"

HTH (although I think it means you need to construct a different implementation strategy);

David.


Many thanks

Lorenzo

######################################


total_entropy_lz <- function(x){

if (length(x)==1){

print("sequence too short")

return("error")

} else{


n <- length(x)

prefactor <- 1/(n*log(n)/log(2))

n_seq <- seq(n)

entropy_list <- n_seq

for (i in n_seq){

entropy_list[i] <- entropy_lz(x,i)


}


}

total_entropy <- 1/(prefactor*sum(entropy_list))


return(total_entropy)

}


entropy_lz <- function(x,i){

past <- x[1:i-1]

n <- length(x)

lp <- length(past)

future <- x[i:n]

go_on <- 1

count_len <- 0

past_string <- paste(past, collapse="#")

while (go_on>0){

new_seq <- x[i:(i+count_len)]

fut_string <- paste(new_seq, collapse="#")

count_len <- count_len+1

if (grepl(fut_string,past_string)!=1){

go_on <- -1
}
}
return(count_len)
}

x <- scan("time_series25_.dat", what="")


S <- total_entropy_lz(x)






On 10/08/2010 07:30 PM, jim holtman wrote:
More specificity: how long is the string, what is the pattern you are
matching against?  It sounds like you might have a complex pattern
that in trying to match the string might be doing a lot of back
tracking and such.  There is an O'Reilly book on Mastering Regular
Expression that might help you understand what might be happening. So
if you can provide a better example than just the error message, it
would be helpful.

On Fri, Oct 8, 2010 at 1:11 PM, Lorenzo Isella<lorenzo.ise...@gmail.com > wrote:
Dear All,
I am experiencing some problems with a script of mine.
It crashes with this message

Error in grepl(fut_string, past_string) :
 invalid regular expression
'12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12
Calls: entropy_estimate_hash -> total_entropy_lz -> entropy_lz - > grepl
In addition: Warning message:
In grepl(fut_string, past_string) : regcomp error:  'Out of memory'
Execution halted

To make a long story short, I use some functions which eventually call grepl on very long strings to check whether a certain substring is part of a
longer string.
Now, the script technically works (it never crashes when I run it on a smaller dataset) and the problem does not seem to be RAM memory (I have several GB of RAM on my machine and its consumption never shoots up so my
machine never resorts to swap memory).
So (though I am not an expert) it looks like the problem is some limitation
of grepl or R memory management.
Any idea about how I could tackle this problem or how I can profile my code to fix it (though it really seems to me that I have to find a way to allow R
to process longer strings).
Any suggestion is appreciated.
Cheers

Lorenzo

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.





______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to