Re: [R] difftimes; histogram; memory problems
Just one further point. If you do run out of memory using #2 then try this which is the same as #2 but adds a dbname argument to force the computation to be done from disk rather than memory. sqldf("select d1.x - d2.x, count(*) from d1, d2 group by d1.x - d2.x", dbname = tempfile()) On Mon, Feb 15, 2010 at 10:45 PM, Gabor Grothendieck wrote: > Here are two approaches to try: > >> # test data >> d1 <- data.frame(x = Sys.Date() + 1:3) >> d2 <- data.frame(x = Sys.Date() - 1:3) > >> # 1. you might not have enough memory for this but its short >> table(outer(1:3, -(1:3), "-")) > > 2 3 4 5 6 > 1 2 3 2 1 > >> # 2. this one performs all the operations outside of R getting >> # result back in so it won't need as much memory >> >> library(sqldf) >> sqldf("select d1.x - d2.x, count(*) from d1, d2 group by d1.x - d2.x") > d1.x - d2.x count(*) > 1 2 1 > 2 3 2 > 3 4 3 > 4 5 2 > 5 6 1 > > > On Mon, Feb 15, 2010 at 9:17 PM, Jonathan wrote: >> Let me fix a couple of typos in that email: >> >> Hi All: >> >> Let's say I have two dataframes (Condition1 and Condition2); each >> being on the order of 12,000 and 16,000 rows; 1 column. The entries >> contain dates. >> >> I'd like to calculate, for each possible pair of dates (that is: >> Condition1[1:12,000] and Condition2[1:16,000], the number of days >> difference between the dates in the pair. The result should be a >> matrix 12,000 by 16,000, which I'll call M. The purpose of building >> such a matrix M is to create a histogram of all the values contained >> within it. >> >> Ex): >> Condition1 <- data.frame('dates' = rep(c('2001-02-10','1998-03-14'),6000)) >> Condition2 <- data.frame('dates' = rep(c('2003-07-06','2007-03-11'),8000)) >> >> First, my instinct is to try and vectorize the operation. I tried >> this by expanding each vector into a matrix of repeated vectors (I'd >> then just subtract the two resultant matrices to get matrix M). I got >> the following error: >> >>> expandedCondition1 <- matrix(rep(Condition1[[1]], nrow(Condition2)), >>> byrow=TRUE, ncol=nrow(Condition1)) >> Error: cannot allocate vector of size 732.4 Mb >>> expandedCondition2 <- matrix(rep(Condition2[[1]], nrow(Condition1)), >>> byrow=FALSE, nrow=nrow(Condition2)) >> Error: cannot allocate vector of size 732.4 Mb >> >> Since it seems these matrices are too large, I'm wondering whether >> there's a better way to call a hist command without actually building >> the said matrix.. >> >> I'd greatly appreciate any ideas! >> >> Best, >> Jonathan >> >> On Mon, Feb 15, 2010 at 8:19 PM, Jonathan wrote: >>> Hi All: >>> >>> Let's say I have two dataframes (Condition1 and Condition2); each >>> being on the order of 12,000 and 16,000 rows; 1 column. The entries >>> contain dates. >>> >>> I'd like to calculate, for each possible pair of dates (that is: >>> Condition1[1:10,000] and Condition2[1:10,000], the number of days >>> difference between the dates in the pair. The result should be a >>> matrix 12,000 by 16,000. Really, what I need is a histogram of all >>> the values in this matrix. >>> >>> Ex): >>> Condition1 <- data.frame('dates' = rep(c('2001-02-10','1998-03-14'),6000)) >>> Condition2 <- data.frame('dates' = rep(c('2003-07-06','2007-03-11'),8000)) >>> >>> First, my instinct is to try and vectorize the operation. I tried >>> this by expanding each vector into a matrix of repeated vectors (I'd >>> then just subtract the two). I got the following error: >>> expandedCondition1 <- matrix(rep(Condition1[[1]], nrow(Condition2)), byrow=TRUE, ncol=nrow(Condition1)) >>> Error: cannot allocate vector of size 732.4 Mb expandedCondition2 <- matrix(rep(Condition2[[1]], nrow(Condition1)), byrow=FALSE, nrow=nrow(Condition2)) >>> Error: cannot allocate vector of size 732.4 Mb >>> >>> Since it seems these matrices are too large, I'm wondering whether >>> there's a better way to call a hist command without actually building >>> the said matrix.. >>> >>> I'd greatly appreciate any ideas! >>> >>> Best, >>> Jonathan >>> >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] difftimes; histogram; memory problems
Hi Jonathan, If minDate = min(Condition1) - max(Condition2) and maxDate = max(Condition1) - min(Condition2) then all your differences would be between minDay and maxDay, and hopefully this is not a very big range (unless you are going many thousands years into the past or the future). So basically for any number of days in this range you should count the number of times it appears. To speed up the calculations you may do this with just one loop (and one vectorized operation) - I can not do this without a single loop (if we want to limit the memory use). Let me know if you need the actual code. Regards, Moshe. --- On Tue, 16/2/10, Jonathan wrote: > From: Jonathan > Subject: Re: [R] difftimes; histogram; memory problems > To: "r-help" > Received: Tuesday, 16 February, 2010, 1:17 PM > Let me fix a couple of typos in that > email: > > Hi All: > > Let's say I have two dataframes (Condition1 and > Condition2); each > being on the order of 12,000 and 16,000 rows; 1 > column. The entries > contain dates. > > I'd like to calculate, for each possible pair of dates > (that is: > Condition1[1:12,000] and Condition2[1:16,000], the number > of days > difference between the dates in the pair. The result > should be a > matrix 12,000 by 16,000, which I'll call M. The > purpose of building > such a matrix M is to create a histogram of all the values > contained > within it. > > Ex): > Condition1 <- data.frame('dates' = > rep(c('2001-02-10','1998-03-14'),6000)) > Condition2 <- data.frame('dates' = > rep(c('2003-07-06','2007-03-11'),8000)) > > First, my instinct is to try and vectorize the > operation. I tried > this by expanding each vector into a matrix of repeated > vectors (I'd > then just subtract the two resultant matrices to get matrix > M). I got > the following error: > > > expandedCondition1 <- matrix(rep(Condition1[[1]], > nrow(Condition2)), byrow=TRUE, ncol=nrow(Condition1)) > Error: cannot allocate vector of size 732.4 Mb > > expandedCondition2 <- matrix(rep(Condition2[[1]], > nrow(Condition1)), byrow=FALSE, nrow=nrow(Condition2)) > Error: cannot allocate vector of size 732.4 Mb > > Since it seems these matrices are too large, I'm wondering > whether > there's a better way to call a hist command without > actually building > the said matrix.. > > I'd greatly appreciate any ideas! > > Best, > Jonathan > > On Mon, Feb 15, 2010 at 8:19 PM, Jonathan > wrote: > > Hi All: > > > > Let's say I have two dataframes (Condition1 and > Condition2); each > > being on the order of 12,000 and 16,000 rows; 1 > column. The entries > > contain dates. > > > > I'd like to calculate, for each possible pair of dates > (that is: > > Condition1[1:10,000] and Condition2[1:10,000], the > number of days > > difference between the dates in the pair. The result > should be a > > matrix 12,000 by 16,000. Really, what I need is a > histogram of all > > the values in this matrix. > > > > Ex): > > Condition1 <- data.frame('dates' = > rep(c('2001-02-10','1998-03-14'),6000)) > > Condition2 <- data.frame('dates' = > rep(c('2003-07-06','2007-03-11'),8000)) > > > > First, my instinct is to try and vectorize the > operation. I tried > > this by expanding each vector into a matrix of > repeated vectors (I'd > > then just subtract the two). I got the following > error: > > > >> expandedCondition1 <- > matrix(rep(Condition1[[1]], nrow(Condition2)), byrow=TRUE, > ncol=nrow(Condition1)) > > Error: cannot allocate vector of size 732.4 Mb > >> expandedCondition2 <- > matrix(rep(Condition2[[1]], nrow(Condition1)), byrow=FALSE, > nrow=nrow(Condition2)) > > Error: cannot allocate vector of size 732.4 Mb > > > > Since it seems these matrices are too large, I'm > wondering whether > > there's a better way to call a hist command without > actually building > > the said matrix.. > > > > I'd greatly appreciate any ideas! > > > > Best, > > Jonathan > > > > __ > R-help@r-project.org > mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, > reproducible code. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] difftimes; histogram; memory problems
Here are two approaches to try: > # test data > d1 <- data.frame(x = Sys.Date() + 1:3) > d2 <- data.frame(x = Sys.Date() - 1:3) > # 1. you might not have enough memory for this but its short > table(outer(1:3, -(1:3), "-")) 2 3 4 5 6 1 2 3 2 1 > # 2. this one performs all the operations outside of R getting > #result back in so it won't need as much memory > > library(sqldf) > sqldf("select d1.x - d2.x, count(*) from d1, d2 group by d1.x - d2.x") d1.x - d2.x count(*) 1 21 2 32 3 43 4 52 5 61 On Mon, Feb 15, 2010 at 9:17 PM, Jonathan wrote: > Let me fix a couple of typos in that email: > > Hi All: > > Let's say I have two dataframes (Condition1 and Condition2); each > being on the order of 12,000 and 16,000 rows; 1 column. The entries > contain dates. > > I'd like to calculate, for each possible pair of dates (that is: > Condition1[1:12,000] and Condition2[1:16,000], the number of days > difference between the dates in the pair. The result should be a > matrix 12,000 by 16,000, which I'll call M. The purpose of building > such a matrix M is to create a histogram of all the values contained > within it. > > Ex): > Condition1 <- data.frame('dates' = rep(c('2001-02-10','1998-03-14'),6000)) > Condition2 <- data.frame('dates' = rep(c('2003-07-06','2007-03-11'),8000)) > > First, my instinct is to try and vectorize the operation. I tried > this by expanding each vector into a matrix of repeated vectors (I'd > then just subtract the two resultant matrices to get matrix M). I got > the following error: > >> expandedCondition1 <- matrix(rep(Condition1[[1]], nrow(Condition2)), >> byrow=TRUE, ncol=nrow(Condition1)) > Error: cannot allocate vector of size 732.4 Mb >> expandedCondition2 <- matrix(rep(Condition2[[1]], nrow(Condition1)), >> byrow=FALSE, nrow=nrow(Condition2)) > Error: cannot allocate vector of size 732.4 Mb > > Since it seems these matrices are too large, I'm wondering whether > there's a better way to call a hist command without actually building > the said matrix.. > > I'd greatly appreciate any ideas! > > Best, > Jonathan > > On Mon, Feb 15, 2010 at 8:19 PM, Jonathan wrote: >> Hi All: >> >> Let's say I have two dataframes (Condition1 and Condition2); each >> being on the order of 12,000 and 16,000 rows; 1 column. The entries >> contain dates. >> >> I'd like to calculate, for each possible pair of dates (that is: >> Condition1[1:10,000] and Condition2[1:10,000], the number of days >> difference between the dates in the pair. The result should be a >> matrix 12,000 by 16,000. Really, what I need is a histogram of all >> the values in this matrix. >> >> Ex): >> Condition1 <- data.frame('dates' = rep(c('2001-02-10','1998-03-14'),6000)) >> Condition2 <- data.frame('dates' = rep(c('2003-07-06','2007-03-11'),8000)) >> >> First, my instinct is to try and vectorize the operation. I tried >> this by expanding each vector into a matrix of repeated vectors (I'd >> then just subtract the two). I got the following error: >> >>> expandedCondition1 <- matrix(rep(Condition1[[1]], nrow(Condition2)), >>> byrow=TRUE, ncol=nrow(Condition1)) >> Error: cannot allocate vector of size 732.4 Mb >>> expandedCondition2 <- matrix(rep(Condition2[[1]], nrow(Condition1)), >>> byrow=FALSE, nrow=nrow(Condition2)) >> Error: cannot allocate vector of size 732.4 Mb >> >> Since it seems these matrices are too large, I'm wondering whether >> there's a better way to call a hist command without actually building >> the said matrix.. >> >> I'd greatly appreciate any ideas! >> >> Best, >> Jonathan >> > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] difftimes; histogram; memory problems
Let me fix a couple of typos in that email: Hi All: Let's say I have two dataframes (Condition1 and Condition2); each being on the order of 12,000 and 16,000 rows; 1 column. The entries contain dates. I'd like to calculate, for each possible pair of dates (that is: Condition1[1:12,000] and Condition2[1:16,000], the number of days difference between the dates in the pair. The result should be a matrix 12,000 by 16,000, which I'll call M. The purpose of building such a matrix M is to create a histogram of all the values contained within it. Ex): Condition1 <- data.frame('dates' = rep(c('2001-02-10','1998-03-14'),6000)) Condition2 <- data.frame('dates' = rep(c('2003-07-06','2007-03-11'),8000)) First, my instinct is to try and vectorize the operation. I tried this by expanding each vector into a matrix of repeated vectors (I'd then just subtract the two resultant matrices to get matrix M). I got the following error: > expandedCondition1 <- matrix(rep(Condition1[[1]], nrow(Condition2)), > byrow=TRUE, ncol=nrow(Condition1)) Error: cannot allocate vector of size 732.4 Mb > expandedCondition2 <- matrix(rep(Condition2[[1]], nrow(Condition1)), > byrow=FALSE, nrow=nrow(Condition2)) Error: cannot allocate vector of size 732.4 Mb Since it seems these matrices are too large, I'm wondering whether there's a better way to call a hist command without actually building the said matrix.. I'd greatly appreciate any ideas! Best, Jonathan On Mon, Feb 15, 2010 at 8:19 PM, Jonathan wrote: > Hi All: > > Let's say I have two dataframes (Condition1 and Condition2); each > being on the order of 12,000 and 16,000 rows; 1 column. The entries > contain dates. > > I'd like to calculate, for each possible pair of dates (that is: > Condition1[1:10,000] and Condition2[1:10,000], the number of days > difference between the dates in the pair. The result should be a > matrix 12,000 by 16,000. Really, what I need is a histogram of all > the values in this matrix. > > Ex): > Condition1 <- data.frame('dates' = rep(c('2001-02-10','1998-03-14'),6000)) > Condition2 <- data.frame('dates' = rep(c('2003-07-06','2007-03-11'),8000)) > > First, my instinct is to try and vectorize the operation. I tried > this by expanding each vector into a matrix of repeated vectors (I'd > then just subtract the two). I got the following error: > >> expandedCondition1 <- matrix(rep(Condition1[[1]], nrow(Condition2)), >> byrow=TRUE, ncol=nrow(Condition1)) > Error: cannot allocate vector of size 732.4 Mb >> expandedCondition2 <- matrix(rep(Condition2[[1]], nrow(Condition1)), >> byrow=FALSE, nrow=nrow(Condition2)) > Error: cannot allocate vector of size 732.4 Mb > > Since it seems these matrices are too large, I'm wondering whether > there's a better way to call a hist command without actually building > the said matrix.. > > I'd greatly appreciate any ideas! > > Best, > Jonathan > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] difftimes; histogram; memory problems
Hi All: Let's say I have two dataframes (Condition1 and Condition2); each being on the order of 12,000 and 16,000 rows; 1 column. The entries contain dates. I'd like to calculate, for each possible pair of dates (that is: Condition1[1:10,000] and Condition2[1:10,000], the number of days difference between the dates in the pair. The result should be a matrix 12,000 by 16,000. Really, what I need is a histogram of all the values in this matrix. Ex): Condition1 <- data.frame('dates' = rep(c('2001-02-10','1998-03-14'),6000)) Condition2 <- data.frame('dates' = rep(c('2003-07-06','2007-03-11'),8000)) First, my instinct is to try and vectorize the operation. I tried this by expanding each vector into a matrix of repeated vectors (I'd then just subtract the two). I got the following error: > expandedCondition1 <- matrix(rep(Condition1[[1]], nrow(Condition2)), > byrow=TRUE, ncol=nrow(Condition1)) Error: cannot allocate vector of size 732.4 Mb > expandedCondition2 <- matrix(rep(Condition2[[1]], nrow(Condition1)), > byrow=FALSE, nrow=nrow(Condition2)) Error: cannot allocate vector of size 732.4 Mb Since it seems these matrices are too large, I'm wondering whether there's a better way to call a hist command without actually building the said matrix.. I'd greatly appreciate any ideas! Best, Jonathan __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.