Re: [R] handling big data set in R
On 3/3/08, shu zhang <[EMAIL PROTECTED]> wrote: > Hello R users, > > I'm wondering whether it is possible to manage big data set in R? I This [1] recent thread might be of interest. Liviu [1] http://www.nabble.com/How-to-read-HUGE-data-sets--tt15729830.html __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] handling big data set in R
Dear Shu, Why not store your dataset in a database? Then you can start each loop by reading the submatrix you need for the analysis. This will require much less memory. loops from the apply-family with work better than the for loop. HTH, Thierry ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest Cel biometrie, methodologie en kwaliteitszorg / Section biometrics, methodology and quality assurance Gaverstraat 4 9500 Geraardsbergen Belgium tel. + 32 54/436 185 [EMAIL PROTECTED] www.inbo.be Do not put your faith in what statistics say until you have carefully considered what they do not say. ~William W. Watt A statistical analysis, properly conducted, is a delicate dissection of uncertainties, a surgery of suppositions. ~M.J.Moroney -Oorspronkelijk bericht- Van: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Namens shu zhang Verzonden: maandag 3 maart 2008 6:35 Aan: r-help@r-project.org Onderwerp: [R] handling big data set in R Hello R users, I'm wondering whether it is possible to manage big data set in R? I have a data set with 3 million rows and 3 columns (X,Y,Z), where X is the group id. For each X, I need to run 2 regression on the submatrix. I used the function "split": datamatrix<-read.csv("datas.csv", header=F, sep=",") dim(datamatrix) # [1] 2980523 3 names(datamatrix)<-c("X","Y","Z") attach(datamatrix) subX<-split(X, X) subY<-split(Y,X) subZ<-split(Z,X) n<-length(subdata) ### number of groups s1<-s2<-rep(NA, n) ### vector to store the regression slope for (i in 1:n){ a<-table(Y[[i]]) table.x<-as.numeric(names(a)) table.y<-as.numeric(a) fit1<-lm(table.y~table.x)# find the slope of the histogram of y s1[i]<-fit$coefficients[2] fit2<-lm(subY[[i]]~subZ[[i]]) ### regress y on z s2[i]<-fit$coefficients[2] } But my R died before completing the loop... (I've thought about doing it in SAS, but I don't know how to write a loop combined with a PROC REG...) One thing that might be helpful is that my data set has already been sorted based on X. I don't know whether this can be any helpful for managing the dataset. Any suggestion would be appreciated! Thanks! -Shu __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] handling big data set in R
Hello R users, I'm wondering whether it is possible to manage big data set in R? I have a data set with 3 million rows and 3 columns (X,Y,Z), where X is the group id. For each X, I need to run 2 regression on the submatrix. I used the function "split": datamatrix<-read.csv("datas.csv", header=F, sep=",") dim(datamatrix) # [1] 2980523 3 names(datamatrix)<-c("X","Y","Z") attach(datamatrix) subX<-split(X, X) subY<-split(Y,X) subZ<-split(Z,X) n<-length(subdata) ### number of groups s1<-s2<-rep(NA, n) ### vector to store the regression slope for (i in 1:n){ a<-table(Y[[i]]) table.x<-as.numeric(names(a)) table.y<-as.numeric(a) fit1<-lm(table.y~table.x)# find the slope of the histogram of y s1[i]<-fit$coefficients[2] fit2<-lm(subY[[i]]~subZ[[i]]) ### regress y on z s2[i]<-fit$coefficients[2] } But my R died before completing the loop... (I've thought about doing it in SAS, but I don't know how to write a loop combined with a PROC REG...) One thing that might be helpful is that my data set has already been sorted based on X. I don't know whether this can be any helpful for managing the dataset. Any suggestion would be appreciated! Thanks! -Shu __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] handling big data set in R
Hello R users, I'm wondering whether it is possible to manage big data set in R? I have a data set with 3 million rows and 3 columns (X,Y,Z), where X is the group id. For each X, I need to run 2 regression on the submatrix. I used the function "split": datamatrix<-read.csv("datas.csv", header=F, sep=",") dim(datamatrix) # [1] 2980523 3 names(datamatrix)<-c("X","Y","Z") attach(datamatrix) subX<-split(X, X) subY<-split(Y,X) subZ<-split(Z,X) n<-length(subdata) ### number of groups s1<-s2<-rep(NA, n) ### vector to store the regression slope for (i in 1:n){ a<-table(Y[[i]]) table.x<-as.numeric(names(a)) table.y<-as.numeric(a) fit1<-lm(table.y~table.x)# find the slope of the histogram of y s1[i]<-fit$coefficients[2] fit2<-lm(subY[[i]]~subZ[[i]]) ### regress y on z s2[i]<-fit$coefficients[2] } But my R died before completing the loop... (I've thought about doing it in SAS, but I don't know how to write a loop combined with a PROC REG...) One thing that might be helpful is that my data set has already been sorted based on X. I don't know whether this can be any helpful for managing the dataset. Any suggestion would be appreciated! Thanks! -Shu __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.