Re: [R] handling big data set in R

2008-03-03 Thread Liviu Andronic
On 3/3/08, shu zhang <[EMAIL PROTECTED]> wrote:
> Hello R users,
>
>  I'm wondering whether it is possible to manage big data set in R? I

This [1] recent thread might be of interest.
Liviu

[1] http://www.nabble.com/How-to-read-HUGE-data-sets--tt15729830.html

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] handling big data set in R

2008-03-03 Thread ONKELINX, Thierry
Dear Shu,

Why not store your dataset in a database? Then you can start each loop
by reading the submatrix you need for the analysis. This will require
much less memory. loops from the apply-family with work better than the
for loop.

HTH,

Thierry



ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature
and Forest
Cel biometrie, methodologie en kwaliteitszorg / Section biometrics,
methodology and quality assurance
Gaverstraat 4
9500 Geraardsbergen
Belgium 
tel. + 32 54/436 185
[EMAIL PROTECTED] 
www.inbo.be 

Do not put your faith in what statistics say until you have carefully
considered what they do not say.  ~William W. Watt
A statistical analysis, properly conducted, is a delicate dissection of
uncertainties, a surgery of suppositions. ~M.J.Moroney

-Oorspronkelijk bericht-
Van: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Namens shu zhang
Verzonden: maandag 3 maart 2008 6:35
Aan: r-help@r-project.org
Onderwerp: [R] handling big data set in R

Hello R users,

I'm wondering whether it is possible to manage big data set in R? I
have a data set with  3 million rows and 3 columns (X,Y,Z), where X is
the group id. For each X, I need to run 2 regression on the submatrix.
I used the function "split":

datamatrix<-read.csv("datas.csv", header=F, sep=",")
dim(datamatrix)
# [1] 2980523  3
names(datamatrix)<-c("X","Y","Z")

attach(datamatrix)

subX<-split(X, X)
subY<-split(Y,X)
subZ<-split(Z,X)
n<-length(subdata)  ### number of groups
s1<-s2<-rep(NA, n)  ### vector to store the regression slope

for (i in 1:n){
  a<-table(Y[[i]])
  table.x<-as.numeric(names(a))
  table.y<-as.numeric(a)
  fit1<-lm(table.y~table.x)# find the slope of the histogram of
y
  s1[i]<-fit$coefficients[2]

  fit2<-lm(subY[[i]]~subZ[[i]])  ### regress y on z
  s2[i]<-fit$coefficients[2]
}


But my R died before completing the loop... (I've thought about doing
it in SAS, but I don't know how to write a loop combined with a PROC
REG...)

One thing that might be helpful is that my data set has already been
sorted based on X. I don't know whether this can be any helpful for
managing the dataset.

Any suggestion would be appreciated!


Thanks!
-Shu

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] handling big data set in R

2008-03-02 Thread shu zhang
Hello R users,

I'm wondering whether it is possible to manage big data set in R? I
have a data set with  3 million rows and 3 columns (X,Y,Z), where X is
the group id. For each X, I need to run 2 regression on the submatrix.
I used the function "split":

datamatrix<-read.csv("datas.csv", header=F, sep=",")
dim(datamatrix)
# [1] 2980523  3
names(datamatrix)<-c("X","Y","Z")

attach(datamatrix)

subX<-split(X, X)
subY<-split(Y,X)
subZ<-split(Z,X)
n<-length(subdata)  ### number of groups
s1<-s2<-rep(NA, n)  ### vector to store the regression slope

for (i in 1:n){
  a<-table(Y[[i]])
  table.x<-as.numeric(names(a))
  table.y<-as.numeric(a)
  fit1<-lm(table.y~table.x)# find the slope of the histogram of y
  s1[i]<-fit$coefficients[2]

  fit2<-lm(subY[[i]]~subZ[[i]])  ### regress y on z
  s2[i]<-fit$coefficients[2]
}


But my R died before completing the loop... (I've thought about doing
it in SAS, but I don't know how to write a loop combined with a PROC
REG...)

One thing that might be helpful is that my data set has already been
sorted based on X. I don't know whether this can be any helpful for
managing the dataset.

Any suggestion would be appreciated!


Thanks!
-Shu

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] handling big data set in R

2008-03-02 Thread shu zhang
Hello R users,

I'm wondering whether it is possible to manage big data set in R? I
have a data set with  3 million rows and 3 columns (X,Y,Z), where X is
the group id. For each X, I need to run 2 regression on the submatrix.
I used the function "split":

datamatrix<-read.csv("datas.csv", header=F, sep=",")
dim(datamatrix)
# [1] 2980523  3
names(datamatrix)<-c("X","Y","Z")

attach(datamatrix)

subX<-split(X, X)
subY<-split(Y,X)
subZ<-split(Z,X)
n<-length(subdata)  ### number of groups
s1<-s2<-rep(NA, n)  ### vector to store the regression slope

for (i in 1:n){
   a<-table(Y[[i]])
   table.x<-as.numeric(names(a))
   table.y<-as.numeric(a)
   fit1<-lm(table.y~table.x)# find the slope of the histogram of y
   s1[i]<-fit$coefficients[2]

   fit2<-lm(subY[[i]]~subZ[[i]])  ### regress y on z
   s2[i]<-fit$coefficients[2]
}


But my R died before completing the loop... (I've thought about doing
it in SAS, but I don't know how to write a loop combined with a PROC
REG...)

One thing that might be helpful is that my data set has already been
sorted based on X. I don't know whether this can be any helpful for
managing the dataset.

Any suggestion would be appreciated!


Thanks!
-Shu

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.