Re: [R] big data?
correcting a typo (400 MB, not GB. Thanks to David Winsemius for reporting it). Spencer ### Thanks to all who replied. For the record, I will summarize here what I tried and what I learned: Mike Harwood suggested the ff package. David Winsemius suggested data.table and colbycol. Peter Langfelder suggested sqldf. sqldf::read.csv.sql allowed me to create an SQL command to read a column or a subset of the rows of a 400 MB tab-delimited file in roughly a minute on a 2.3 GHz dual core machine running Windows 7 with 8 GB RAM. It also read a column of a 1.3 GB file in 4 minutes. The documentation was sufficient to allow me to easily get what I wanted with a minimum of effort. If I needed to work with these data regularly, I might experiment with colbycol and ff: The documentation suggested to me that these packages might allow me to get quicker answers from routine tasks after some preprocessing. Of course, I could also do the preprocessing manually with sqldf. Thanks, again. Spencer On 8/6/2014 9:39 AM, Mike Harwood wrote: The read.table.ffdf function in the ff package can read in delimited files and store them to disk as individual columns. The ffbase package provides additional data management and analytic functionality. I have used these packages on 15 Gb files of 18 million rows and 250 columns. On Tuesday, August 5, 2014 1:39:03 PM UTC-5, David Winsemius wrote: On Aug 5, 2014, at 10:20 AM, Spencer Graves wrote: What tools do you like for working with tab delimited text files up to 1.5 GB (under Windows 7 with 8 GB RAM)? ?data.table::fread Standard tools for smaller data sometimes grab all the available RAM, after which CPU usage drops to 3% ;-) The "bigmemory" project won the 2010 John Chambers Award but "is not available (for R version 3.1.0)". findFn("big data", 999) downloaded 961 links in 437 packages. That contains tools for data PostgreSQL and other formats, but I couldn't find anything for large tab delimited text files. Absent a better idea, I plan to write a function getField to extract a specific field from the data, then use that to split the data into 4 smaller files, which I think should be small enough that I can do what I want. There is the colbycol package with which I have no experience, but I understand it is designed to partition data into column sized objects. #--- from its help file- cbc.get.col {colbycol}R Documentation Reads a single column from the original file into memory Description Function cbc.read.table reads a file, stores it column by column in disk file and creates a colbycol object. Functioncbc.get.col queries this object and returns a single column. Thanks, Spencer __ r-h...@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA __ r-h...@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] big data?
Thanks to all who replied. For the record, I will summarize here what I tried and what I learned: Mike Harwood suggested the ff package. David Winsemius suggested data.table and colbycol. Peter Langfelder suggested sqldf. sqldf::read.csv.sql allowed me to create an SQL command to read a column or a subset of the rows of a 400 GB tab-delimited file in roughly a minute on a 2.3 GHz dual core machine running Windows 7 with 8 GB RAM. It also read a column of a 1.3 GB file in 4 minutes. The documentation was sufficient to allow me to easily get what I wanted with a minimum of effort. If I needed to work with these data regularly, I might experiment with colbycol and ff: The documentation suggested to me that these packages might allow me to get quicker answers from routine tasks after some preprocessing. Of course, I could also do the preprocessing manually with sqldf. Thanks, again. Spencer On 8/6/2014 9:39 AM, Mike Harwood wrote: The read.table.ffdf function in the ff package can read in delimited files and store them to disk as individual columns. The ffbase package provides additional data management and analytic functionality. I have used these packages on 15 Gb files of 18 million rows and 250 columns. On Tuesday, August 5, 2014 1:39:03 PM UTC-5, David Winsemius wrote: On Aug 5, 2014, at 10:20 AM, Spencer Graves wrote: What tools do you like for working with tab delimited text files up to 1.5 GB (under Windows 7 with 8 GB RAM)? ?data.table::fread Standard tools for smaller data sometimes grab all the available RAM, after which CPU usage drops to 3% ;-) The "bigmemory" project won the 2010 John Chambers Award but "is not available (for R version 3.1.0)". findFn("big data", 999) downloaded 961 links in 437 packages. That contains tools for data PostgreSQL and other formats, but I couldn't find anything for large tab delimited text files. Absent a better idea, I plan to write a function getField to extract a specific field from the data, then use that to split the data into 4 smaller files, which I think should be small enough that I can do what I want. There is the colbycol package with which I have no experience, but I understand it is designed to partition data into column sized objects. #--- from its help file- cbc.get.col {colbycol}R Documentation Reads a single column from the original file into memory Description Function cbc.read.table reads a file, stores it column by column in disk file and creates a colbycol object. Functioncbc.get.col queries this object and returns a single column. Thanks, Spencer __ r-h...@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA __ r-h...@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Spencer Graves, PE, PhD President and Chief Technology Officer Structure Inspection and Monitoring, Inc. 751 Emerson Ct. San José, CA 95126 ph: 408-655-4567 web: www.structuremonitoring.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] big data?
The read.table.ffdf function in the ff package can read in delimited files and store them to disk as individual columns. The ffbase package provides additional data management and analytic functionality. I have used these packages on 15 Gb files of 18 million rows and 250 columns. On Tuesday, August 5, 2014 1:39:03 PM UTC-5, David Winsemius wrote: > > > On Aug 5, 2014, at 10:20 AM, Spencer Graves wrote: > > > What tools do you like for working with tab delimited text files up > to 1.5 GB (under Windows 7 with 8 GB RAM)? > > ?data.table::fread > > > Standard tools for smaller data sometimes grab all the available > RAM, after which CPU usage drops to 3% ;-) > > > > > > The "bigmemory" project won the 2010 John Chambers Award but "is > not available (for R version 3.1.0)". > > > > > > findFn("big data", 999) downloaded 961 links in 437 packages. That > contains tools for data PostgreSQL and other formats, but I couldn't find > anything for large tab delimited text files. > > > > > > Absent a better idea, I plan to write a function getField to > extract a specific field from the data, then use that to split the data > into 4 smaller files, which I think should be small enough that I can do > what I want. > > There is the colbycol package with which I have no experience, but I > understand it is designed to partition data into column sized objects. > #--- from its help file- > cbc.get.col {colbycol}R Documentation > Reads a single column from the original file into memory > > Description > > Function cbc.read.table reads a file, stores it column by column in disk > file and creates a colbycol object. Functioncbc.get.col queries this object > and returns a single column. > > > Thanks, > > Spencer > > > > __ > > r-h...@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > David Winsemius > Alameda, CA, USA > > __ > r-h...@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] big data?
On Aug 5, 2014, at 10:20 AM, Spencer Graves wrote: > What tools do you like for working with tab delimited text files up to > 1.5 GB (under Windows 7 with 8 GB RAM)? ?data.table::fread > Standard tools for smaller data sometimes grab all the available RAM, > after which CPU usage drops to 3% ;-) > > > The "bigmemory" project won the 2010 John Chambers Award but "is not > available (for R version 3.1.0)". > > > findFn("big data", 999) downloaded 961 links in 437 packages. That > contains tools for data PostgreSQL and other formats, but I couldn't find > anything for large tab delimited text files. > > > Absent a better idea, I plan to write a function getField to extract a > specific field from the data, then use that to split the data into 4 smaller > files, which I think should be small enough that I can do what I want. There is the colbycol package with which I have no experience, but I understand it is designed to partition data into column sized objects. #--- from its help file- cbc.get.col {colbycol} R Documentation Reads a single column from the original file into memory Description Function cbc.read.table reads a file, stores it column by column in disk file and creates a colbycol object. Functioncbc.get.col queries this object and returns a single column. > Thanks, > Spencer > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] big data?
Have you tried read.csv.sql from package sqldf? Peter On Tue, Aug 5, 2014 at 10:20 AM, Spencer Graves wrote: > What tools do you like for working with tab delimited text files up to > 1.5 GB (under Windows 7 with 8 GB RAM)? > > > Standard tools for smaller data sometimes grab all the available RAM, > after which CPU usage drops to 3% ;-) > > > The "bigmemory" project won the 2010 John Chambers Award but "is not > available (for R version 3.1.0)". > > > findFn("big data", 999) downloaded 961 links in 437 packages. That > contains tools for data PostgreSQL and other formats, but I couldn't find > anything for large tab delimited text files. > > > Absent a better idea, I plan to write a function getField to extract a > specific field from the data, then use that to split the data into 4 smaller > files, which I think should be small enough that I can do what I want. > > > Thanks, > Spencer > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Big Data reading subsample csv
The read.csv.sql function in the sqldf package may make this approach quite simple. On Thu, Aug 16, 2012 at 10:12 AM, jim holtman wrote: > Why not put this into a database, and then you can easily extract the > records you want specifying the record numbers. You play the one time > expense of creating the database, but then have much faster access to > the data as you make subsequent runs. > > On Thu, Aug 16, 2012 at 9:44 AM, Tudor Medallion > wrote: >> Hello, >> >> I'm most grateful for your time to read this. >> >> I have a uber size 30GB file of 6 million records and 3000 (mostly >> categorical data) columns in csv format. I want to bootstrap subsamples for >> multinomial regression, but it's proving difficult even with my 64GB RAM >> in my machine and twice that swap file , the process becomes super slow >> and halts. >> >> I'm thinking about generating subsample indicies in R and feeding them into >> a system command using sed or awk, but don't know how to do this. If >> someone knew of a clean way to do this using just R commands, I would be >> really grateful. >> >> One problem is that I need to pick complete observations of subsamples, >> that is I need to have all the rows of a particular multinomial observation >> - they are not the same length from observation to observation. I plan to >> use glmnet and then some fancy transforms to get an approximation to the >> multinomial case. One other point is that I don't know how to choose sample >> size to fit around memory limits. >> >> Appreciate your thoughts greatly. >> >> >>> R.version >> >> platform x86_64-pc-linux-gnu >> arch x86_64 >> os linux-gnu >> system x86_64, linux-gnu >> status >> major 2 >> minor 15.1 >> year 2012 >> month 06 >> day22 >> svn rev59600 >> language R >> version.string R version 2.15.1 (2012-06-22) >> nickname Roasted Marshmallows >> >> >> tags: read.csv(), system(), awk, sed, sample(), glmnet, multinomial, MASS. >> >> Yoda >> >> [[alternative HTML version deleted]] >> >> __ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > > > -- > Jim Holtman > Data Munger Guru > > What is the problem that you are trying to solve? > Tell me what you want to do, not how you want to do it. > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Gregory (Greg) L. Snow Ph.D. 538...@gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Big Data reading subsample csv
Why not put this into a database, and then you can easily extract the records you want specifying the record numbers. You play the one time expense of creating the database, but then have much faster access to the data as you make subsequent runs. On Thu, Aug 16, 2012 at 9:44 AM, Tudor Medallion wrote: > Hello, > > I'm most grateful for your time to read this. > > I have a uber size 30GB file of 6 million records and 3000 (mostly > categorical data) columns in csv format. I want to bootstrap subsamples for > multinomial regression, but it's proving difficult even with my 64GB RAM > in my machine and twice that swap file , the process becomes super slow > and halts. > > I'm thinking about generating subsample indicies in R and feeding them into > a system command using sed or awk, but don't know how to do this. If > someone knew of a clean way to do this using just R commands, I would be > really grateful. > > One problem is that I need to pick complete observations of subsamples, > that is I need to have all the rows of a particular multinomial observation > - they are not the same length from observation to observation. I plan to > use glmnet and then some fancy transforms to get an approximation to the > multinomial case. One other point is that I don't know how to choose sample > size to fit around memory limits. > > Appreciate your thoughts greatly. > > >> R.version > > platform x86_64-pc-linux-gnu > arch x86_64 > os linux-gnu > system x86_64, linux-gnu > status > major 2 > minor 15.1 > year 2012 > month 06 > day22 > svn rev59600 > language R > version.string R version 2.15.1 (2012-06-22) > nickname Roasted Marshmallows > > > tags: read.csv(), system(), awk, sed, sample(), glmnet, multinomial, MASS. > > Yoda > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Big data and column correspondence problem
Daniel, thanks for the help. I finally made it, doing the merging separately. Daniel Malter wrote: > > On a different note: how are you matching if AA has multiple matches in > BB? > About that, all I have to do is check whether, for any of the BB which matches with AA, the indicator equals 1. If not, the dummy variable assume the value 0 in the original data. Again, thank you for the pacience. -- View this message in context: http://r.789695.n4.nabble.com/Big-data-and-column-correspondence-problem-tp3694912p3699988.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Big data and column correspondence problem
If A has more columns than in your example, you could always try to only merge those columns of A with B that are relevant for the merging. You could then cbind the result of the merging back together with the rest of A as long as the merged data preserved the same order as in A. Alternatively, you can always use chunks of A and do the merging separately, e.g., for blocks of 1 observations or so. x<-sample(1:150,15000,replace=T) y<-sample(1:150,15000,replace=T) a<-rnorm(15000) b<-rnorm(15000) A<-cbind(x,a) B<-cbind(y,b) system.time(newdata<-merge(A,B,by.x='x',by.y='y',all.x=T,all.y=F)) On a MacBook Pro with 4 Gs of RAM and a 2.4 GHz Duo Core processor it would take you about 40 minutes if you do chunks for 15000 observations. I am not sure whether the loop would be slower than that. On a different note: how are you matching if AA has multiple matches in BB? Daniel murilofm wrote: > > Thanks Daniel, that helped me. Based on your suggestions I built this > final code: > > library(foreign) > library(gdata) > > AA = c(4,4,4,2,2,6,8,9) > A1 = c(3,3,11,5,5,7,11,12) > A2 = c(3,3,7,3,5,7,11,12) > A = cbind(AA, A1, A2) > > BB = c(2,2,4,6,6) > B1 =c(5,11,7,13,NA) > B2 =c(4,12,11,NA,NA) > B3 =c(12,13,NA,NA,NA) > > A = cbind(AA, A1, A2,0) > B=cbind(BB,B1,B2,B3) > > newdata<-merge(A,B,by.x='AA',by.y='BB',all.x=T,all.y=F) > newdata$dum <- rowSums (newdata[,matchcols(newdata, > with=c("B"))]==newdata$A1, na.rm = FALSE, dims = 1)* > rowSums (newdata[,matchcols(newdata, with=c("B"))]==newdata$A2, na.rm = > FALSE, dims = 1) > > colnames(A)[4]<-"dum" > newdata$dum1<-newdata$dum > A_final<-merge(A,newdata,by.x=c("AA","A1","A2","dum"),by.y=c("AA","A1","A2","dum"),all.x=T,all.y=F) > > Which gives me the same result of the "loop" version. Unfortunately, I > can't replicate it on the original data since i can't make the merge work: > i get an error message "Reached total allocation of 4090Mb". So, I'm stuck > again. > > If anyone could shed some light on this problem, i would really > appreciate. > -- View this message in context: http://r.789695.n4.nabble.com/Big-data-and-column-correspondence-problem-tp3694912p3697709.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Big data and column correspondence problem
Thanks Daniel, that helped me. Based on your suggestions I built this final code: library(foreign) library(gdata) AA = c(4,4,4,2,2,6,8,9) A1 = c(3,3,11,5,5,7,11,12) A2 = c(3,3,7,3,5,7,11,12) A = cbind(AA, A1, A2) BB = c(2,2,4,6,6) B1 =c(5,11,7,13,NA) B2 =c(4,12,11,NA,NA) B3 =c(12,13,NA,NA,NA) A = cbind(AA, A1, A2,0) B=cbind(BB,B1,B2,B3) newdata<-merge(A,B,by.x='AA',by.y='BB',all.x=T,all.y=F) newdata$dum <- rowSums (newdata[,matchcols(newdata, with=c("B"))]==newdata$A1, na.rm = FALSE, dims = 1)* rowSums (newdata[,matchcols(newdata, with=c("B"))]==newdata$A2, na.rm = FALSE, dims = 1) colnames(A)[4]<-"dum" newdata$dum1<-newdata$dum A_final<-merge(A,newdata,by.x=c("AA","A1","A2","dum"),by.y=c("AA","A1","A2","dum"),all.x=T,all.y=F) Which gives me the same result of the "loop" version. Unfortunately, I can't replicate it on the original data since i can't make the merge work: i get an error message "Reached total allocation of 4090Mb". So, I'm stuck again. If anyone could shed some light on this problem, i would really appreciate. -- View this message in context: http://r.789695.n4.nabble.com/Big-data-and-column-correspondence-problem-tp3694912p3697557.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Big data and column correspondence problem
This is much clearer. So here is what I think you want to do. In theory and practice: Theory: Check if AA[i] is in BB If AA[i] is in BB, then take the row where BB[j] == AA[i] and check whether A1 and A2 are in B1 to B3. Is that right? Only if both are, you want the indicator to take 1. Here is how you do this: newdata<-merge(A,B,by.x='AA',by.y='BB',all.x=F,all.y=F) A1.check<-with(newdata,A1==B1|A1==B2|A1==B3) B1.check<-with(newdata,A2==B1|A1==B2|A1==B3) A1.check<-replace(A1.check,which(is.na(A1.check)),0) B1.check<-replace(B1.check,which(is.na(B1.check)),0) newdata<-data.frame(newdata,A1.check,B1.check) newdata$index<-with(newdata,ifelse(A1.check+B1.check==2,1,0)) HTH, Daniel murilofm wrote: > >>>I can not see A1[1]=20 in your example data. > > Sorry about the typos A1[1]=3. > >>> Why B[3,]? > > Because AA[1]=BB[3]=4. > > I will reformulate the example with the code I'm running: > > AA = c(4,4,4,2,2,6,8,9) > A1 = c(3,3,11,5,5,7,11,12) > A2 = c(3,3,7,3,5,7,11,12) > A = cbind(AA, A1, A2) > > BB = c(2,2,4,6,6) > B1 =c(5,11,7,13,NA) > B2 =c(4,12,11,NA,NA) > B3 =c(12,13,NA,NA,NA) > > A = cbind(AA, A1, A2,0) > B=cbind(BB,B1,B2,B3) > > for(i in 1:dim(A)[1]){ > if (!is.na(sum(match(A[i,2:3],B[B[,1]==A[i,1],2:dim(B)[2]] A[i,4]<-1 > } > > Thanks > -- View this message in context: http://r.789695.n4.nabble.com/Big-data-and-column-correspondence-problem-tp3694912p3697067.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Big data and column correspondence problem
Hi > Re: [R] Big data and column correspondence problem > > Daniel, thanks for the answer. > I will try to make myself i little bit clearer. Doing step by step I would > have (using a loop trough the lines of 'A'): I am not sure if you are successful in your clarifying. > > 1. AA[1] is 4. As so, I would have to compare A1[1] = 20 and A2[1] =3 with I can not see A1[1]=20 in your example data. > A[1,] AA A1 A2 4 3 3 gives me this. > >B1 B2 B3 > B[3,2:4] 7 11 NA Why B[3,]? > > beacause BB[3]=4. Since there is no match, this would retrieve me a zero. > The same would happen with AA[2]. For AA[3] I have > > AA A1 A2 > [3,] 4 11 7 > > Since both A1[3] = 20 and A2[3] =3 match with B[3,2:4] this would retrieve > me 1. In what sense those two lines match? A[3,] AA A1 A2 4 5 5 B[3,] BB B1 B2 B3 4 7 11 NA I must say I am completely lost. Maybe you could try to present a code with your toy data which give desired result but is too slow with original data. Regards Petr > > 2. For AA[4:5] i would have to compare each line with B[1:2,2:4]. That is, > for AA[4]=2 i have a match with BB[1] and BB[2]. Then I have to compare > > A1 A2 > [4,] 5 5 > > with > >B1 B2 B3 > B[1,2:4] 5 3 12 > > and > > B1 B2 B3 > B[2,2:4] 11 12 13 > > Again, for A1[4] and A2[4] and would have no match. But A1[5] and A1[5] > match with B2[1] and B1[1]. > > 3. And so on for the other lines of A. > > The problem is that if I perform that as a loop it really takes to long. > Hope i could make it clearer. > > -- > View this message in context: http://r.789695.n4.nabble.com/Big-data-and- > column-correspondence-problem-tp3694912p3695795.html > Sent from the R help mailing list archive at Nabble.com. > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Big data and column correspondence problem
Daniel, thanks for the answer. I will try to make myself i little bit clearer. Doing step by step I would have (using a loop trough the lines of 'A'): 1. AA[1] is 4. As so, I would have to compare A1[1] = 20 and A2[1] =3 with B1 B2 B3 B[3,2:4] 7 11 NA beacause BB[3]=4. Since there is no match, this would retrieve me a zero. The same would happen with AA[2]. For AA[3] I have AA A1 A2 [3,] 4 11 7 Since both A1[3] = 20 and A2[3] =3 match with B[3,2:4] this would retrieve me 1. 2. For AA[4:5] i would have to compare each line with B[1:2,2:4]. That is, for AA[4]=2 i have a match with BB[1] and BB[2]. Then I have to compare A1 A2 [4,] 5 5 with B1 B2 B3 B[1,2:4] 5 3 12 and B1 B2 B3 B[2,2:4] 11 12 13 Again, for A1[4] and A2[4] and would have no match. But A1[5] and A1[5] match with B2[1] and B1[1]. 3. And so on for the other lines of A. The problem is that if I perform that as a loop it really takes to long. Hope i could make it clearer. -- View this message in context: http://r.789695.n4.nabble.com/Big-data-and-column-correspondence-problem-tp3694912p3695795.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Big data and column correspondence problem
For question (a), do: which(AA%in%BB) Question (b) is very ambiguous to me. It makes little sense for your example because all values of BB are in AA. Therefore I am wondering whether you meant in question (a) that you want to find all values in BB that are in AA. That's not the same thing. I am also not sure what exactly you mean by "within the lines of B that correspond to the values of AA. If you mean "all the lines of B that for which AA is in BB, then you get that by: B[which(AA%in%BB) , ] However, this gives an error because AA has more values in BB than the number of rows in B. This leads me to believe that you might want which(BB%in%AA) for question (a). In this case you would get the lines of B by B[which(BB%in%AA) , ] which in this example are all rows of B. Again, part (b) is very opaque to me. It would help if you described it step by step as to what it should and what the outcome of every step along the way should be. Just from the final result that it should produce and your description, I cannot make sense of it. But maybe another helper can. Daniel murilofm wrote: > > Greetings, > > I've been struggling for some time with a problem concerning a big > database that i have to deal with. > I'll try to exemplify my problem since the database is really big. > Suppose I have the following data: > > AA = c(4,4,4,2,2,6,8,9) > A1 = c(3,3,5,5,5,7,11,12) > A2 = c(3,3,5,5,5,7,11,12) > A = cbind(A, A1, A2) > > BB = c(2,2,4,6,6) > B1 =c(5,11,7,13,NA) > B2 =c(3,12,11,NA,NA) > B3 =c(12,13,NA,NA,NA) > > B=cbind(BB,B1,B2,B3) > > I have to do the following: > > 1. Create a dummy (binary) variable in a new column of A that indicates > if, for each row: > a) the value from the column AA can be found in BB > b) within the lines of B that corresponds to the value of AA, I can find > both A1 and A2 in B1, B2 or B3. > > In this example i would have > [0,0,1,1,1,0,0,0] > > I been able to do it with some loops; the problem is that since in the > original data A has 2.936.044 lines and B has 14.965 it's taking forever > to finish (probably because I might be doing the wrong way). > > I would really appreciate any help or advice on how to deal with this. > Thanks! > -- View this message in context: http://r.789695.n4.nabble.com/Big-data-and-column-correspondence-problem-tp3694912p3695065.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Big data (over 2GB) and lmer
On Thu, Oct 21, 2010 at 2:00 PM, Ben Bolker wrote: > Michal Figurski mail.med.upenn.edu> writes: > >> I have a data set of roughly 10 million records, 7 columns. It has only >> about 500MB as a csv, so it fits in the memory. It's painfully slow to >> do anything with it, but it's possible. I also have another dataset of >> covariates that I would like to explore - with about 4GB of data... >> >> I would like to merge the two datasets and use lmer to build a mixed >> effects model. Is there a way, for example using 'bigmemory' or 'ff', or >> any other trick, to enable lmer to work on this data set? > > I don't think this will be easy. > > Do you really need mixed effects for this task? i.e., are > your numbers per group sufficiently small that you will benefit > from the shrinkage etc. afforded by mixed models? If you have > (say) 1 individuals per group, 1000 groups, then I would > expect you'd get very accurate estimates of the group coefficients, > you could then calculate variances etc. among these estimates. > > You might get more informed answers on r-sig-mixed-mod...@r-project.org ... lmer already stores the model matrices and factors related to the random effects as sparse matrices. Depending on the complexity of the model - in particular, if random effects are defined with respect to more than one grouping factor and, if so, if those factors are nested or not - storing the Cholesky factor of the random effects model matrix will be the limiting factor. This object has many slots but only two very large ones in the sense that they are long vectors. At present vectors accessed or created by R are limited to 2^31 elements because they are indexed by 32-bit integers. So the short answer is, "it depends". Simple models may be possible. Complex models will need to wait upon decisions about using wider integer representations in indices. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] big data and lmer
Though bigmemory, ff, and other big data solutions (databases, etc...) can help easily manage massive data, their data objects are not natively compatible with all the advanced functionality of R. Exceptions include lm and glm (both ff and bigmemory support his via Lumley's biglm package), kmeans, and perhaps a few other things. In many cases, it's just a matter of someone deciding to port a tool/analysis of interest to one of these different object types -- we welcome collaborators and would be happy to offer advice if you want to adapt something for bigmemory structures! Jay -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Big data (over 2GB) and lmer
Michal Figurski mail.med.upenn.edu> writes: > I have a data set of roughly 10 million records, 7 columns. It has only > about 500MB as a csv, so it fits in the memory. It's painfully slow to > do anything with it, but it's possible. I also have another dataset of > covariates that I would like to explore - with about 4GB of data... > > I would like to merge the two datasets and use lmer to build a mixed > effects model. Is there a way, for example using 'bigmemory' or 'ff', or > any other trick, to enable lmer to work on this data set? I don't think this will be easy. Do you really need mixed effects for this task? i.e., are your numbers per group sufficiently small that you will benefit from the shrinkage etc. afforded by mixed models? If you have (say) 1 individuals per group, 1000 groups, then I would expect you'd get very accurate estimates of the group coefficients, you could then calculate variances etc. among these estimates. You might get more informed answers on r-sig-mixed-mod...@r-project.org ... Ben Bolker __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] big data
In addition to Dirks advice about the biglm package, you may also want to look at the RSQLite and SQLiteDF packages which may make dealing with the large dataset faster and easier, especially for passing the chunks to bigglm. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 > -Original Message- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- > project.org] On Behalf Of André de Boer > Sent: Wednesday, September 08, 2010 5:27 AM > To: r-help@r-project.org > Subject: [R] big data > > Hello, > > I searched the internet but i didn't find the answer for the next > problem: > I want to do a glm on a csv file consisting of 25 columns and 4 mln > rows. > Not all the columns are relevant. My problem is to read the data into > R. > Manipulate the data and then do a glm. > > I've tried with: > > dd<-scan("myfile.csv",colClasses=classes) > dat<-as.data.frame(dd) > > My question is: what is the right way to do is? > Can someone give me a hint? > > Thanks, > Arend > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] big data
On 8 September 2010 at 13:26, André de Boer wrote: | I searched the internet but i didn't find the answer for the next problem: | I want to do a glm on a csv file consisting of 25 columns and 4 mln rows. | Not all the columns are relevant. My problem is to read the data into R. | Manipulate the data and then do a glm. | | I've tried with: | | dd<-scan("myfile.csv",colClasses=classes) | dat<-as.data.frame(dd) | | My question is: what is the right way to do is? | Can someone give me a hint? Look at the biglm package by Thomas Lumley which will allow you to fit glm models in "chunks". Dirk -- Dirk Eddelbuettel | e...@debian.org | http://dirk.eddelbuettel.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] big data file versus ram memory
On Dec 18, 2008, at 3:07 PM, Stephan Kolassa wrote: Hi Mauricio, Mauricio Calvao schrieb: 1) I would like very much to use R for processing some big data files (around 1.7 or more GB) for spatial analysis, wavelets, and power spectra estimation; is this possible with R? Within IDL, such a big data set seems to be tractable... There are some packages to handle large datasets, e.g., bigmemoRy. There were a couple of presentations on various ways to work with large datasets at the last useR conference - take a look at the presentations at http://www.statistik.uni-dortmund.de/useR-2008/ You'll probably be most interested in the "High Performance" streams. 2) I have heard/read that R "puts all its data on ram"? Does this really mean my data file cannot be bigger than my ram memory? The philosophy is basically to use RAM. Anything working outside RAM is not exactly heretical to R, but it does require some additional effort. 3) If I have a big enough ram, would I be able to process whatever data set?? What constrains the practical limits of my data sets?? From what I understand - little to nothing, beyond the time needed for computations. Er, ... it depends. At a minimum a person considering this should have read the FAQs. If this is a question about Windows, then R-Win FAQ 2.9: http://cran.r-project.org/bin/windows/base/rw-FAQ.html#There-seems-to-be-a-limit-on-the-memory-it-uses_0021 There has been quite a bit about this in the list over the last couple of years. Search the archives: http://search.r-project.org/ -- David Winsemius HTH, Stephan __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] big data file versus ram memory
Hi Mauricio, Mauricio Calvao schrieb: 1) I would like very much to use R for processing some big data files (around 1.7 or more GB) for spatial analysis, wavelets, and power spectra estimation; is this possible with R? Within IDL, such a big data set seems to be tractable... There are some packages to handle large datasets, e.g., bigmemoRy. There were a couple of presentations on various ways to work with large datasets at the last useR conference - take a look at the presentations at http://www.statistik.uni-dortmund.de/useR-2008/ You'll probably be most interested in the "High Performance" streams. 2) I have heard/read that R "puts all its data on ram"? Does this really mean my data file cannot be bigger than my ram memory? The philosophy is basically to use RAM. Anything working outside RAM is not exactly heretical to R, but it does require some additional effort. 3) If I have a big enough ram, would I be able to process whatever data set?? What constrains the practical limits of my data sets?? From what I understand - little to nothing, beyond the time needed for computations. HTH, Stephan __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.