Re: [R] Memory management in R

2010-10-10 Thread Mike Marchywka








> Date: Sun, 10 Oct 2010 15:27:11 +0200
> From: lorenzo.ise...@gmail.com
> To: dwinsem...@comcast.net
> CC: r-help@r-project.org
> Subject: Re: [R] Memory management in R
>
>
> > I already offered the Biostrings package. It provides more robust
> > methods for string matching than does grepl. Is there a reason that you
> > choose not to?
> >
>
> Indeed that is the way I should go for and I have installed the package
> after some struggling. Since biostring is a fairly complex package and I
> need only a way to check if a certain string A is a subset of string B,
> do you know the biostring functions to achieve this?
> I see a lot of methods for biological (DNA, RNA) sequences, and they may
> not apply to my series (which are definitely not from biology).

Generally the differences relate to alphabet and "things you may want
to know about them." Unless you are looking for reverse complement
text strings, there will be a lot of stuff you don't need. Offhand,
I'd be looking for things like computational linguistics packages
as you are looking to find patterns or predictability in human readable 
character sequences. Now, humans can probably write hairpin-text( look
at what RNA can do LOL) but this is probably not what you care about. 

However,  as I mentioned earlier, I had to write my own regex compiler ( 
coincidently
for bio apps ) to get required performance. Your application and understanding
may benefit from things like building dictionaries that aren't really
part of regex and that can easily be done in a few lines of c++ code
using STL containers. To get statistically meaningful samples, you almost
will certainly need faster code.




> Cheers
>
> Lorenzo
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-10 Thread Lorenzo Isella



I already offered the Biostrings package. It provides more robust
methods for string matching than does grepl. Is there a reason that you
choose not to?



Indeed that is the way I should go for and I have installed the package 
after some struggling. Since biostring is a fairly complex package and I 
need only a way to check if a certain string A is a subset of string B, 
do you know the biostring functions to achieve this?
I see a lot of methods for biological (DNA, RNA) sequences, and they may 
not apply to my series (which are definitely not from biology).

Cheers

Lorenzo

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-09 Thread David Winsemius


On Oct 9, 2010, at 4:23 PM, Lorenzo Isella wrote:




My suggestion is to explore other alternatives. (I will admit that I
don't yet fully understand the test that you are applying.)


Hi,
I am trying to partially implement the Lempel Ziv compression  
algorithm.
The point is that compressibility and entropy of a time series are  
related, hence my final goal is to evaluate the entropy of a time  
series.

You can find more at

http://bit.ly/93zX4T
http://en.wikipedia.org/wiki/LZ77_and_LZ78
http://bit.ly/9NgIFt




The two that

have occurred to me are Biostrings which I have already mentioned and
rle() which I have illustrated the use of but not referenced as an
avenue. The Biostrings package is part of bioConductor (part of the R
universe) although you should be prepared for a coffee break when you
install it if you haven't gotten at least bioClite already installed.
When I installed it last night it had 54 other package dependents  
also

downloaded and installed. It seems to me that taking advantage of the
coding resources in the molecular biology domain that are currently
directed at decoding the information storage mechanism of life  
might be
a smart strategy. You have not described the domain you are working  
in

but I would guess that the "digest" package might be biological in
primary application? So forgive me if I am preaching to the choir.

The rle option also occurred to me but it might take a smarter coder
than I to fully implement it. (But maybe Holtman would be up to it.  
He's

a _lot_ smarter than I.) In your example the long "x" string is
faithfully represented by two aligned vectors, each 197 characters in
length. The long repeat sequence that broke the grepl mechanism are  
just

one pair of values.
> rle(x)
Run Length Encoding
lengths: int [1:197] 1 1 2 1 1 4 1 9 1 1 ...
values : chr [1:197] "5d64d58a" "ac76183b" "202fbcc4" "78087f5e" ...

So maybe as soon as you got to a bundle that was greater than 1/2 the
overall length (as happened in the "x" case) you could stop, since it
could not have "occurred before".



I doubt that rle() can be deployed to replace Lempel-Ziv (LZ)  
algorithm in a trivial way. As a less convoluted example, consider  
the series


x <- c("d","a","b","d","a","b","e","z")

If i=4 and therefore the i-th element is the second 'd' in the  
series, the shortest series starting from i=4 that I do not see in  
the past of 'd' is


"d","a","b","e", whose length is equal to 4 and that is the value  
returned by the function below.
The frustrating thing is that I already have the tools I need, just  
they crash for reasons beyond my control on relatively short series.
If anyone can make the function below more robust, that is really a  
big help for me.


I already offered the Biostrings package. It provides more robust  
methods for string matching than does grepl. Is there a reason that  
you choose not to?


--
David.

Cheers

Lorenzo

###
entropy_lz <- function(x,i){

past <- x[1:i-1]

n <- length(x)

lp <- length(past)

future <- x[i:n]

go_on <- 1

count_len <- 0

past_string <- paste(past, collapse="#")

while (go_on>0){

new_seq <- x[i:(i+count_len)]

fut_string <- paste(new_seq, collapse="#")

count_len <- count_len+1

if (grepl(fut_string,past_string)!=1){

go_on <- -1

}
}
return(count_len)

}

x <- c("c","a","b","c","a","b","e","z")

S <- entropy_lz(x,4)


David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-09 Thread Lorenzo Isella



My suggestion is to explore other alternatives. (I will admit that I
don't yet fully understand the test that you are applying.)


Hi,
I am trying to partially implement the Lempel Ziv compression algorithm.
The point is that compressibility and entropy of a time series are 
related, hence my final goal is to evaluate the entropy of a time series.

You can find more at

http://bit.ly/93zX4T
http://en.wikipedia.org/wiki/LZ77_and_LZ78
http://bit.ly/9NgIFt




The two that

have occurred to me are Biostrings which I have already mentioned and
rle() which I have illustrated the use of but not referenced as an
avenue. The Biostrings package is part of bioConductor (part of the R
universe) although you should be prepared for a coffee break when you
install it if you haven't gotten at least bioClite already installed.
When I installed it last night it had 54 other package dependents also
downloaded and installed. It seems to me that taking advantage of the
coding resources in the molecular biology domain that are currently
directed at decoding the information storage mechanism of life might be
a smart strategy. You have not described the domain you are working in
but I would guess that the "digest" package might be biological in
primary application? So forgive me if I am preaching to the choir.

The rle option also occurred to me but it might take a smarter coder
than I to fully implement it. (But maybe Holtman would be up to it. He's
a _lot_ smarter than I.) In your example the long "x" string is
faithfully represented by two aligned vectors, each 197 characters in
length. The long repeat sequence that broke the grepl mechanism are just
one pair of values.
 > rle(x)
Run Length Encoding
lengths: int [1:197] 1 1 2 1 1 4 1 9 1 1 ...
values : chr [1:197] "5d64d58a" "ac76183b" "202fbcc4" "78087f5e" ...

So maybe as soon as you got to a bundle that was greater than 1/2 the
overall length (as happened in the "x" case) you could stop, since it
could not have "occurred before".



I doubt that rle() can be deployed to replace Lempel-Ziv (LZ) algorithm 
in a trivial way. As a less convoluted example, consider the series


x <- c("d","a","b","d","a","b","e","z")

If i=4 and therefore the i-th element is the second 'd' in the series, 
the shortest series starting from i=4 that I do not see in the past of 
'd' is


"d","a","b","e", whose length is equal to 4 and that is the value 
returned by the function below.
The frustrating thing is that I already have the tools I need, just they 
crash for reasons beyond my control on relatively short series.
If anyone can make the function below more robust, that is really a big 
help for me.

Cheers

Lorenzo

###
entropy_lz <- function(x,i){

past <- x[1:i-1]

n <- length(x)

lp <- length(past)

future <- x[i:n]

go_on <- 1

count_len <- 0

past_string <- paste(past, collapse="#")

while (go_on>0){

new_seq <- x[i:(i+count_len)]

fut_string <- paste(new_seq, collapse="#")

count_len <- count_len+1

if (grepl(fut_string,past_string)!=1){

go_on <- -1

}
}
return(count_len)

}

x <- c("c","a","b","c","a","b","e","z")

S <- entropy_lz(x,4)

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-09 Thread David Winsemius


On Oct 9, 2010, at 9:45 AM, Lorenzo Isella wrote:


Hi David,
I am replying to you and to the other people who provided some  
insight into my problems with grepl.

Well, at least we now know that the bug is reproducible.
Indeed it is a strange sequence the one I am postprocessing,  
probably pathological to some extent, nevertheless the problem is  
given by grepl crushing when a long (but not huge) chunk of repeated  
data is loaded has to be acknowledged.
Now, my problem is the following: given a potentially long string  
(or before that a sequence, where every element has been generated  
via the hash function, algo='crc32' of the digest package), how can  
I, starting from an arbitrary position i along the list, calculate  
the shortest substring in the future of i (i.e. the interval i:end  
of the series) that has not occurred in the past of i (i.e. [1:i-1])?


Maybe you should work on a less convoluted explanation of the test? Or  
perhaps a couple of compact examples, preferably in R-copy-paste format?


Efficiency is not the main point here, I need to run this code only  
once to get what I need, but it cannot crush on a 2000-entry string.


My suggestion is to explore other alternatives. (I will admit that I  
don't yet fully understand the test that you are applying.) The two  
that have occurred to me are Biostrings which I have already mentioned  
and rle() which I have illustrated the use of but not referenced as an  
avenue. The Biostrings package is part of bioConductor (part of the R  
universe) although you should be prepared for a coffee break when you  
install it if you haven't gotten at least bioClite already installed.  
When I installed it last night it had 54 other package dependents also  
downloaded and installed. It seems to me that taking advantage of the  
coding resources in the molecular biology domain that are currently  
directed at decoding the information storage mechanism of life might  
be a smart strategy. You have not described the domain you are working  
in but I would guess that the "digest" package might be biological in  
primary application? So forgive me if I am preaching to the choir.


The rle option also occurred to me but it might take a smarter coder  
than I to fully implement it. (But maybe Holtman would be up to it.  
He's a _lot_ smarter than I.)  In your example the long "x" string is  
faithfully represented by two aligned vectors, each 197 characters in  
length. The long repeat sequence that broke the grepl mechanism are  
just one pair of values.

> rle(x)
Run Length Encoding
  lengths: int [1:197] 1 1 2 1 1 4 1 9 1 1 ...
  values : chr [1:197] "5d64d58a" "ac76183b" "202fbcc4" "78087f5e" ...

So maybe as soon as you got to a bundle that was greater than 1/2 the  
overall length (as happened in the "x" case) you could stop, since it  
could not have "occurred before".


--
David.



Cheers

Lorenzo


On 10/09/2010 01:30 AM, David Winsemius wrote:


What puzzles me is that the list is not really long (less than 2000
entries) and I have not experienced the same problem even with  
longer

lists.


But maybe your loop terminated in them eaarlier/ Someplace between
11*225 and 11*240 the grepping machine gives up:

> eprs <- paste(rep("aa", 225), collapse="#")
> grepl(eprs, eprs)
[1] TRUE

> eprs <- paste(rep("aa", 240), collapse="#")
> grepl(eprs, eprs)
Error in grepl(eprs, eprs) :
invalid regular expression
'aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#a

In addition: Warning message:
In grepl(eprs, eprs) : regcomp error: 'Out of memory'

The complexity of the problem may depend on the distribution of  
values.
You have a very skewed distribution with the vast majority being in  
the

same value as appeared in your error message :

> table(x)
x
12653a6 202fbcc4 48bef8c3 4e084ddc 51f342a4 5d64d58a 78087f5e  
abddf3d1

1419 299 1 1 1 3 1 1
ac76183b b955be36 c600173a e96f6bbd e9c56275
1 30 5 1 9

And you have 1159 of them in one clump (which would seem to be  
somewhat

improbably under a random null hypothesis:

> max(rle(x)$lengths)
[1

Re: [R] Memory management in R

2010-10-09 Thread Lorenzo Isella

Hi David,
I am replying to you and to the other people who provided some insight 
into my problems with grepl.

Well, at least we now know that the bug is reproducible.
Indeed it is a strange sequence the one I am postprocessing, probably 
pathological to some extent, nevertheless the problem is given by grepl 
crushing when a long (but not huge) chunk of repeated data is loaded has 
to be acknowledged.
Now, my problem is the following: given a potentially long string (or 
before that a sequence, where every element has been generated via the 
hash function, algo='crc32' of the digest package), how can I, starting 
from an arbitrary position i along the list, calculate the shortest 
substring in the future of i (i.e. the interval i:end of the series) 
that has not occurred in the past of i (i.e. [1:i-1])?
Efficiency is not the main point here, I need to run this code only once 
to get what I need, but it cannot crush on a 2000-entry string.

Cheers

Lorenzo


On 10/09/2010 01:30 AM, David Winsemius wrote:


What puzzles me is that the list is not really long (less than 2000
entries) and I have not experienced the same problem even with longer
lists.


But maybe your loop terminated in them eaarlier/ Someplace between
11*225 and 11*240 the grepping machine gives up:

 > eprs <- paste(rep("aa", 225), collapse="#")
 > grepl(eprs, eprs)
[1] TRUE

 > eprs <- paste(rep("aa", 240), collapse="#")
 > grepl(eprs, eprs)
Error in grepl(eprs, eprs) :
invalid regular expression
'aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#a

In addition: Warning message:
In grepl(eprs, eprs) : regcomp error: 'Out of memory'

The complexity of the problem may depend on the distribution of values.
You have a very skewed distribution with the vast majority being in the
same value as appeared in your error message :

 > table(x)
x
12653a6 202fbcc4 48bef8c3 4e084ddc 51f342a4 5d64d58a 78087f5e abddf3d1
1419 299 1 1 1 3 1 1
ac76183b b955be36 c600173a e96f6bbd e9c56275
1 30 5 1 9

And you have 1159 of them in one clump (which would seem to be somewhat
improbably under a random null hypothesis:

 > max(rle(x)$lengths)
[1] 1159
 > which(rle(x)$lengths == 1159)
[1] 123
 > rle(x)$values[123]
[1] "12653a6"

HTH (although I think it means you need to construct a different
implementation strategy);

David.



Many thanks

Lorenzo




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-08 Thread David Winsemius


On Oct 8, 2010, at 9:19 PM, Mike Marchywka wrote:



From: dwinsem...@comcast.net
To: lorenzo.ise...@gmail.com
Date: Fri, 8 Oct 2010 19:30:45 -0400
CC: r-help@r-project.org
Subject: Re: [R] Memory management in R


On Oct 8, 2010, at 6:42 PM, Lorenzo Isella wrote:




Please find below the R snippet which requires an input file (a
simple text file) you can download from

http://dl.dropbox.com/u/5685598/time_series25_.dat

What puzzles me is that the list is not really long (less than 2000
entries) and I have not experienced the same problem even with
longer lists.


But maybe your loop terminated in them eaarlier/ Someplace between
11*225 and 11*240 the grepping machine gives up:


eprs <- paste(rep("aa", 225), collapse="#")
grepl(eprs, eprs)

[1] TRUE


eprs <- paste(rep("aa", 240), collapse="#")
grepl(eprs, eprs)

Error in grepl(eprs, eprs) :
invalid regular expression
'aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#a
In addition: Warning message:
In grepl(eprs, eprs) : regcomp error: 'Out of memory'

The complexity of the problem may depend on the distribution of
values. You have a very skewed distribution with the vast majority
being in the same value as appeared in your error message :





HTH (although I think it means you need to construct a different
implementation strategy);


You really need to look at the question posed by your regex and  
consider

the complexity of what you are asking and what likely implementations
would do with your regex.


The R regex machine (at least on a Mac with R 2.11.1)  breaks when the  
length of the the pattern argument exceeds  2559 characters. There is  
no complexity  for the regex parser here. No metacharacters were in  
the string.



Something like this probably needs to be implemented
in dedicated code to handle the more general case or you need to  
determine

if input data is pathological given your regex.


There is a Biostrings package in BioC that may provide more robust  
treatment of long strings.


--
David.



Being able to write something
concisely doesn't mean the execution of that something is simple.  
Even if
it does manage to return a result, it likely will get very slow. In  
the
past I have had to write my own simple regex compilers to handle a  
limited
class of expressions to make the speed reasonable. In this case,  
depending
on your objectives, dedicated code may even be helpful to you in  
understanding

the algorithm.



David.



Many thanks

Lorenzo






David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-08 Thread Mike Marchywka








> From: dwinsem...@comcast.net
> To: lorenzo.ise...@gmail.com
> Date: Fri, 8 Oct 2010 19:30:45 -0400
> CC: r-help@r-project.org
> Subject: Re: [R] Memory management in R
>
>
> On Oct 8, 2010, at 6:42 PM, Lorenzo Isella wrote:
>

> > Please find below the R snippet which requires an input file (a
> > simple text file) you can download from
> >
> > http://dl.dropbox.com/u/5685598/time_series25_.dat
> >
> > What puzzles me is that the list is not really long (less than 2000
> > entries) and I have not experienced the same problem even with
> > longer lists.
>
> But maybe your loop terminated in them eaarlier/ Someplace between
> 11*225 and 11*240 the grepping machine gives up:
>
> > eprs <- paste(rep("aa", 225), collapse="#")
> > grepl(eprs, eprs)
> [1] TRUE
>
> > eprs <- paste(rep("aa", 240), collapse="#")
> > grepl(eprs, eprs)
> Error in grepl(eprs, eprs) :
> invalid regular expression
> 'aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#a
> In addition: Warning message:
> In grepl(eprs, eprs) : regcomp error: 'Out of memory'
>
> The complexity of the problem may depend on the distribution of
> values. You have a very skewed distribution with the vast majority
> being in the same value as appeared in your error message :
>

>
> HTH (although I think it means you need to construct a different
> implementation strategy);

You really need to look at the question posed by your regex and consider 
the complexity of what you are asking and what likely implementations
would do with your regex. Something like this probably needs to be implemented
in dedicated code to handle the more general case or you need to determine
if input data is pathological given your regex. Being able to write something
concisely doesn't mean the execution of that something is simple. Even if
it does manage to return a result, it likely will get very slow. In the
past I have had to write my own simple regex compilers to handle a limited
class of expressions to make the speed reasonable. In this case, depending
on your objectives, dedicated code may even be helpful to you in understanding
the algorithm. 

>
> David.
>
>
> > Many thanks
> >
> > Lorenzo
> >

  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-08 Thread David Winsemius


On Oct 8, 2010, at 6:42 PM, Lorenzo Isella wrote:


Thanks for lending a helping hand.
I put together a self-contained example. Basically, it all relies on  
a couple of functions, where one function simply iterates the  
application of the other function.
I am trying to implement the so-called Lempel-Ziv entropy estimator.  
The idea is to choose a position i along a string x (standing for a  
time series) and find the length of the shortest string starting  
from i which has never occurred before i.
Please find below the R snippet which requires an input file (a  
simple text file) you can download from


http://dl.dropbox.com/u/5685598/time_series25_.dat

What puzzles me is that the list is not really long (less than 2000  
entries) and I have not experienced the same problem even with  
longer lists.


But maybe your loop terminated in them eaarlier/ Someplace between  
11*225 and 11*240 the grepping machine gives up:


> eprs <- paste(rep("aa", 225), collapse="#")
> grepl(eprs, eprs)
[1] TRUE

> eprs <- paste(rep("aa", 240), collapse="#")
> grepl(eprs, eprs)
Error in grepl(eprs, eprs) :
  invalid regular expression  
'aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#aa#a

In addition: Warning message:
In grepl(eprs, eprs) : regcomp error:  'Out of memory'

The complexity of the problem may depend on the distribution of  
values. You have a very skewed distribution with the vast majority  
being in the same value as appeared in your error message :


> table(x)
x
 12653a6 202fbcc4 48bef8c3 4e084ddc 51f342a4 5d64d58a 78087f5e abddf3d1
1419  299111311
ac76183b b955be36 c600173a e96f6bbd e9c56275
   1   30519

And you have 1159 of them in one clump (which would seem to be  
somewhat improbably under a random null hypothesis:


> max(rle(x)$lengths)
[1] 1159
> which(rle(x)$lengths == 1159)
[1] 123
> rle(x)$values[123]
[1] "12653a6"

HTH (although I think it means you need to construct a different  
implementation strategy);


David.



Many thanks

Lorenzo

##


total_entropy_lz <- function(x){

if (length(x)==1){

print("sequence too short")

return("error")

} else{


n <- length(x)

prefactor <- 1/(n*log(n)/log(2))

n_seq <- seq(n)

entropy_list <- n_seq

for (i in n_seq){

entropy_list[i] <- entropy_lz(x,i)


}


}

total_entropy <- 1/(prefactor*sum(entropy_list))


return(total_entropy)

}


entropy_lz <- function(x,i){

past <- x[1:i-1]

n <- length(x)

lp <- length(past)

future <- x[i:n]

go_on <- 1

count_len <- 0

past_string <- paste(past, collapse="#")

while (go_on>0){

new_seq <- x[i:(i+count_len)]

fut_string <- paste(new_seq, collapse="#")

count_len <- count_len+1

if (grepl(fut_string,past_string)!=1){

go_on <- -1
}
}
return(count_len)
}

x <- scan("time_series25_.dat", what="")


S <- total_entropy_lz(x)






On 10/08/2010 07:30 PM, jim holtman wrote:

More specificity: how long is the string, what is the pattern you are
matching against?  It sounds like you might have a complex pattern
that in trying to match the string might be doing a lot of back
tracking and such.  There is an O'Reilly book on Mastering Regular
Expression that might help you understand what might be happening.   
So

if you can provide a better example than just the error message, it
would be helpful.

On Fri, Oct 8, 2010 at 1:11 PM, Lorenzo Isella>  wrote:

Dear All,
I am experiencing some problems with a script of mine.
It crashes with this message

Error in grepl(fut_string, past_string) :
 invalid regular expression
'12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#

Re: [R] Memory management in R

2010-10-08 Thread Lorenzo Isella

Thanks for lending a helping hand.
I put together a self-contained example. Basically, it all relies on a 
couple of functions, where one function simply iterates the application 
of the other function.
I am trying to implement the so-called Lempel-Ziv entropy estimator. The 
idea is to choose a position i along a string x (standing for a time 
series) and find the length of the shortest string starting from i which 
has never occurred before i.
Please find below the R snippet which requires an input file (a simple 
text file) you can download from


http://dl.dropbox.com/u/5685598/time_series25_.dat

What puzzles me is that the list is not really long (less than 2000 
entries) and I have not experienced the same problem even with longer lists.

Many thanks

Lorenzo

##


total_entropy_lz <- function(x){

if (length(x)==1){

print("sequence too short")

return("error")

} else{


n <- length(x)

prefactor <- 1/(n*log(n)/log(2))

n_seq <- seq(n)

entropy_list <- n_seq

for (i in n_seq){

entropy_list[i] <- entropy_lz(x,i)


}


}

total_entropy <- 1/(prefactor*sum(entropy_list))


return(total_entropy)

}


entropy_lz <- function(x,i){

past <- x[1:i-1]

n <- length(x)

lp <- length(past)

future <- x[i:n]

go_on <- 1

count_len <- 0

past_string <- paste(past, collapse="#")

while (go_on>0){

new_seq <- x[i:(i+count_len)]

fut_string <- paste(new_seq, collapse="#")

count_len <- count_len+1

if (grepl(fut_string,past_string)!=1){

go_on <- -1
}
}
return(count_len)
}

x <- scan("time_series25_.dat", what="")


S <- total_entropy_lz(x)






On 10/08/2010 07:30 PM, jim holtman wrote:

More specificity: how long is the string, what is the pattern you are
matching against?  It sounds like you might have a complex pattern
that in trying to match the string might be doing a lot of back
tracking and such.  There is an O'Reilly book on Mastering Regular
Expression that might help you understand what might be happening.  So
if you can provide a better example than just the error message, it
would be helpful.

On Fri, Oct 8, 2010 at 1:11 PM, Lorenzo Isella  wrote:

Dear All,
I am experiencing some problems with a script of mine.
It crashes with this message

Error in grepl(fut_string, past_string) :
  invalid regular expression
'12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12
Calls: entropy_estimate_hash ->  total_entropy_lz ->  entropy_lz ->  grepl
In addition: Warning message:
In grepl(fut_string, past_string) : regcomp error:  'Out of memory'
Execution halted

To make a long story short, I use some functions which eventually call grepl
on very long strings to check whether a certain substring is part of a
longer string.
Now, the script technically works (it never crashes when I run it on a
smaller dataset) and the problem does not seem to be RAM memory (I have
several GB of RAM on my machine and its consumption never shoots up so my
machine never resorts to swap memory).
So (though I am not an expert) it looks like the problem is some limitation
of grepl or R memory management.
Any idea about how I could tackle this problem or how I can profile my code
to fix it (though it really seems to me that I have to find a way to allow R
to process longer strings).
Any suggestion is appreciated.
Cheers

Lorenzo

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.







__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-08 Thread Mike Marchywka







> Date: Fri, 8 Oct 2010 13:30:59 -0400
> From: jholt...@gmail.com
> To: lorenzo.ise...@gmail.com
> CC: r-help@r-project.org
> Subject: Re: [R] Memory management in R
>
> More specificity: how long is the string, what is the pattern you are
> matching against? It sounds like you might have a complex pattern
> that in trying to match the string might be doing a lot of back
> tracking and such. There is an O'Reilly book on Mastering Regular
> Expression that might help you understand what might be happening. So
> if you can provide a better example than just the error message, it
> would be helpful.


This is possibly a stack issue. Error messages are not often literal,
I have seen out of memory for graphic device objects :) Regex suggests
a stack issue but that would be a guess on the mechanism of death but
what you probably really want is a simpler regex :)




>
> On Fri, Oct 8, 2010 at 1:11 PM, Lorenzo Isella  wrote:
> > Dear All,
> > I am experiencing some problems with a script of mine.
> > It crashes with this message
> >
> > Error in grepl(fut_string, past_string) :
> >  invalid regular expression
> > '12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12
> > Calls: entropy_estimate_hash -> total_entropy_lz -> entropy_lz -> grepl
> > In addition: Warning message:
> > In grepl(fut_string, past_string) : regcomp error:  'Out of memory'
> > Execution halted
> >
> > To make a long story short, I use some functions which eventually call grepl
> > on very long strings to check whether a certain substring is part of a
> > longer string.
> > Now, the script technically works (it never crashes when I run it on a
> > smaller dataset) and the problem does not seem to be RAM memory (I have
> > several GB of RAM on my machine and its consumption never shoots up so my
> > machine never resorts to swap memory).
> > So (though I am not an expert) it looks like the problem is some limitation
> > of grepl or R memory management.
> > Any idea about how I could tackle this problem or how I can profile my code
> > to fix it (though it really seems to me that I have to find a way to allow R
> > to process longer strings).
> > Any suggestion is appreciated.
> > Cheers
> >
> > Lorenzo
> >
> > __
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-08 Thread jim holtman
More specificity: how long is the string, what is the pattern you are
matching against?  It sounds like you might have a complex pattern
that in trying to match the string might be doing a lot of back
tracking and such.  There is an O'Reilly book on Mastering Regular
Expression that might help you understand what might be happening.  So
if you can provide a better example than just the error message, it
would be helpful.

On Fri, Oct 8, 2010 at 1:11 PM, Lorenzo Isella  wrote:
> Dear All,
> I am experiencing some problems with a script of mine.
> It crashes with this message
>
> Error in grepl(fut_string, past_string) :
>  invalid regular expression
> '12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12
> Calls: entropy_estimate_hash -> total_entropy_lz -> entropy_lz -> grepl
> In addition: Warning message:
> In grepl(fut_string, past_string) : regcomp error:  'Out of memory'
> Execution halted
>
> To make a long story short, I use some functions which eventually call grepl
> on very long strings to check whether a certain substring is part of a
> longer string.
> Now, the script technically works (it never crashes when I run it on a
> smaller dataset) and the problem does not seem to be RAM memory (I have
> several GB of RAM on my machine and its consumption never shoots up so my
> machine never resorts to swap memory).
> So (though I am not an expert) it looks like the problem is some limitation
> of grepl or R memory management.
> Any idea about how I could tackle this problem or how I can profile my code
> to fix it (though it really seems to me that I have to find a way to allow R
> to process longer strings).
> Any suggestion is appreciated.
> Cheers
>
> Lorenzo
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-08 Thread Doran, Harold
These questions are OS-specific. Please provide sessionInfo() or other details 
as needed

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Lorenzo Isella
Sent: Friday, October 08, 2010 1:12 PM
To: r-help
Subject: [R] Memory management in R

Dear All,
I am experiencing some problems with a script of mine.
It crashes with this message

Error in grepl(fut_string, past_string) :
   invalid regular expression 
'12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12
Calls: entropy_estimate_hash -> total_entropy_lz -> entropy_lz -> grepl
In addition: Warning message:
In grepl(fut_string, past_string) : regcomp error:  'Out of memory'
Execution halted

To make a long story short, I use some functions which eventually call 
grepl on very long strings to check whether a certain substring is part 
of a longer string.
Now, the script technically works (it never crashes when I run it on a 
smaller dataset) and the problem does not seem to be RAM memory (I have 
several GB of RAM on my machine and its consumption never shoots up so 
my machine never resorts to swap memory).
So (though I am not an expert) it looks like the problem is some 
limitation of grepl or R memory management.
Any idea about how I could tackle this problem or how I can profile my 
code to fix it (though it really seems to me that I have to find a way 
to allow R to process longer strings).
Any suggestion is appreciated.
Cheers

Lorenzo

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory management in R

2010-10-08 Thread Lorenzo Isella

On 10/08/2010 07:25 PM, Doran, Harold wrote:

These questions are OS-specific. Please provide sessionInfo() or other details 
as needed




I see. I am running R on a 64 bit machine running Ubuntu 10.04

> sessionInfo()
R version 2.11.1 (2010-05-31)
x86_64-pc-linux-gnu

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C  LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base


and in case it matters, this is the output of my top command

$ top

top - 19:28:21 up  8:04,  8 users,  load average: 0.60, 0.72, 1.33
Tasks: 220 total,   1 running, 219 sleeping,   0 stopped,   0 zombie
Cpu(s): 10.3%us,  0.6%sy,  0.0%ni, 87.2%id,  1.9%wa,  0.0%hi,  0.0%si, 
0.0%st

Mem:   6110484k total,  3847008k used,  2263476k free,72748k buffers
Swap:  2929656k total,0k used,  2929656k free,  2621420k cached

Cheers

Lorenzo


-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
Sent: Friday, October 08, 2010 1:12 PM
To: r-help
Subject: [R] Memory management in R

Dear All,
I am experiencing some problems with a script of mine.
It crashes with this message

Error in grepl(fut_string, past_string) :
invalid regular expression
'12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12
Calls: entropy_estimate_hash ->  total_entropy_lz ->  entropy_lz ->  grepl
In addition: Warning message:
In grepl(fut_string, past_string) : regcomp error:  'Out of memory'
Execution halted

To make a long story short, I use some functions which eventually call
grepl on very long strings to check whether a certain substring is part
of a longer string.
Now, the script technically works (it never crashes when I run it on a
smaller dataset) and the problem does not seem to be RAM memory (I have
several GB of RAM on my machine and its consumption never shoots up so
my machine never resorts to swap memory).
So (though I am not an expert) it looks like the problem is some
limitation of grepl or R memory management.
Any idea about how I could tackle this problem or how I can profile my
code to fix it (though it really seems to me that I have to find a way
to allow R to process longer strings).
Any suggestion is appreciated.
Cheers

Lorenzo

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Memory management in R

2010-10-08 Thread Lorenzo Isella

Dear All,
I am experiencing some problems with a script of mine.
It crashes with this message

Error in grepl(fut_string, past_string) :
  invalid regular expression 
'12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12653a6#12

Calls: entropy_estimate_hash -> total_entropy_lz -> entropy_lz -> grepl
In addition: Warning message:
In grepl(fut_string, past_string) : regcomp error:  'Out of memory'
Execution halted

To make a long story short, I use some functions which eventually call 
grepl on very long strings to check whether a certain substring is part 
of a longer string.
Now, the script technically works (it never crashes when I run it on a 
smaller dataset) and the problem does not seem to be RAM memory (I have 
several GB of RAM on my machine and its consumption never shoots up so 
my machine never resorts to swap memory).
So (though I am not an expert) it looks like the problem is some 
limitation of grepl or R memory management.
Any idea about how I could tackle this problem or how I can profile my 
code to fix it (though it really seems to me that I have to find a way 
to allow R to process longer strings).

Any suggestion is appreciated.
Cheers

Lorenzo

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] memory management in R

2010-06-16 Thread Jens Oehlschlägel
You might want to mention/talk about packages that enhance R's ability to work 
with less RAM / more data, such as package SOAR (transparently moving objects 
between RAM and disk) and ff (which allows vectors and dataframes larger than 
RAM and which supports dense datatypes like true boolean, short integers etc.). 

Jens Oehlschlägel



-Ursprüngliche Nachricht-
Von: john 
Gesendet: Jun 16, 2010 12:20:17 PM
An: r-help@r-project.org
Betreff: [R] memory management in R

>
>
>I have volunteered to give a short talk on "memory management in R" 
>   to my local R user group, mainly to motivate myself to learn about it. 
>
>The focus will be on what a typical R coder might want to know  ( e.g. how
>objects are created, call by value, basics of garbage collection ) but I
>want to go a little deeper just in case there are some advanced users in the
>crowd. 
>
>Here are the resources I am using right now
>  Chambers book "Software for Data Analysis" 
>  Manuals such as "R Internals" and "Writing R Extensions" 
>
>Any suggestions on other sources of information? 
>
>There are still some things that are not clear to me, such as
>  - how to make sense of the output from various memory diagnostics such as 
>memory.profile ... are these counts? 
>How to get the amount of memory used: gc() and memory.size() seem to
>differ
> -  what gets allocated on the heap versus stack
> - why the name "cons cells" for the stack allocation 
>
>Any help with these would be greatly appreciated. 
>
>Thanks greatly, 
>
>John Muller
>
>__
>R-help@r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] memory management in R

2010-06-16 Thread john


I have volunteered to give a short talk on "memory management in R" 
   to my local R user group, mainly to motivate myself to learn about it. 

The focus will be on what a typical R coder might want to know  ( e.g. how
objects are created, call by value, basics of garbage collection ) but I
want to go a little deeper just in case there are some advanced users in the
crowd. 

Here are the resources I am using right now
  Chambers book "Software for Data Analysis" 
  Manuals such as "R Internals" and "Writing R Extensions" 

Any suggestions on other sources of information? 

There are still some things that are not clear to me, such as
  - how to make sense of the output from various memory diagnostics such as 
memory.profile ... are these counts? 
How to get the amount of memory used: gc() and memory.size() seem to
differ
 -  what gets allocated on the heap versus stack
 - why the name "cons cells" for the stack allocation 

Any help with these would be greatly appreciated. 

Thanks greatly, 

John Muller

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.