Re: [R] big data?

2014-08-07 Thread Spencer Graves
correcting a typo (400 MB, not GB.  Thanks to David Winsemius for 
reporting it).  Spencer



###


  Thanks to all who replied.  For the record, I will summarize here 
what I tried and what I learned:



  Mike Harwood suggested the ff package.  David Winsemius suggested 
data.table and colbycol.  Peter Langfelder suggested sqldf.



  sqldf::read.csv.sql allowed me to create an SQL command to read a 
column or a subset of the rows of a 400 MB tab-delimited file in roughly 
a minute on a 2.3 GHz dual core machine running Windows 7 with 8 GB RAM. 
 It also read a column of a 1.3 GB file in 4 minutes.  The 
documentation was sufficient to allow me to easily get what I wanted 
with a minimum of effort.



  If I needed to work with these data regularly, I might experiment 
with colbycol and ff:  The documentation suggested to me that these 
packages might allow me to get quicker answers from routine tasks after 
some preprocessing.  Of course, I could also do the preprocessing 
manually with sqldf.



  Thanks, again.
  Spencer


On 8/6/2014 9:39 AM, Mike Harwood wrote:

The read.table.ffdf function in the ff package can read in delimited files
and store them to disk as individual columns.  The ffbase package provides
additional data management and analytic functionality.  I have used these
packages on 15 Gb files of 18 million rows and 250 columns.


On Tuesday, August 5, 2014 1:39:03 PM UTC-5, David Winsemius wrote:


On Aug 5, 2014, at 10:20 AM, Spencer Graves wrote:


  What tools do you like for working with tab delimited text files up

to 1.5 GB (under Windows 7 with 8 GB RAM)?

?data.table::fread


  Standard tools for smaller data sometimes grab all the available

RAM, after which CPU usage drops to 3% ;-)


  The "bigmemory" project won the 2010 John Chambers Award but "is

not available (for R version 3.1.0)".


  findFn("big data", 999) downloaded 961 links in 437 packages. That

contains tools for data PostgreSQL and other formats, but I couldn't find
anything for large tab delimited text files.


  Absent a better idea, I plan to write a function getField to

extract a specific field from the data, then use that to split the data
into 4 smaller files, which I think should be small enough that I can do
what I want.

There is the colbycol package with which I have no experience, but I
understand it is designed to partition data into column sized objects.
#--- from its help file-
cbc.get.col {colbycol}R Documentation
Reads a single column from the original file into memory

Description

Function cbc.read.table reads a file, stores it column by column in disk
file and creates a colbycol object. Functioncbc.get.col queries this object
and returns a single column.


  Thanks,
  Spencer

__
r-h...@r-project.org  mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide

http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

__
r-h...@r-project.org  mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] big data?

2014-08-07 Thread Spencer Graves
  Thanks to all who replied.  For the record, I will summarize here 
what I tried and what I learned:



  Mike Harwood suggested the ff package.  David Winsemius suggested 
data.table and colbycol.  Peter Langfelder suggested sqldf.



  sqldf::read.csv.sql allowed me to create an SQL command to read a 
column or a subset of the rows of a 400 GB tab-delimited file in roughly 
a minute on a 2.3 GHz dual core machine running Windows 7 with 8 GB 
RAM.  It also read a column of a 1.3 GB file in 4 minutes.  The 
documentation was sufficient to allow me to easily get what I wanted 
with a minimum of effort.



  If I needed to work with these data regularly, I might experiment 
with colbycol and ff:  The documentation suggested to me that these 
packages might allow me to get quicker answers from routine tasks after 
some preprocessing.  Of course, I could also do the preprocessing 
manually with sqldf.



  Thanks, again.
  Spencer


On 8/6/2014 9:39 AM, Mike Harwood wrote:

The read.table.ffdf function in the ff package can read in delimited files
and store them to disk as individual columns.  The ffbase package provides
additional data management and analytic functionality.  I have used these
packages on 15 Gb files of 18 million rows and 250 columns.


On Tuesday, August 5, 2014 1:39:03 PM UTC-5, David Winsemius wrote:


On Aug 5, 2014, at 10:20 AM, Spencer Graves wrote:


  What tools do you like for working with tab delimited text files up

to 1.5 GB (under Windows 7 with 8 GB RAM)?

?data.table::fread


  Standard tools for smaller data sometimes grab all the available

RAM, after which CPU usage drops to 3% ;-)


  The "bigmemory" project won the 2010 John Chambers Award but "is

not available (for R version 3.1.0)".


  findFn("big data", 999) downloaded 961 links in 437 packages. That

contains tools for data PostgreSQL and other formats, but I couldn't find
anything for large tab delimited text files.


  Absent a better idea, I plan to write a function getField to

extract a specific field from the data, then use that to split the data
into 4 smaller files, which I think should be small enough that I can do
what I want.

There is the colbycol package with which I have no experience, but I
understand it is designed to partition data into column sized objects.
#--- from its help file-
cbc.get.col {colbycol}R Documentation
Reads a single column from the original file into memory

Description

Function cbc.read.table reads a file, stores it column by column in disk
file and creates a colbycol object. Functioncbc.get.col queries this object
and returns a single column.


  Thanks,
  Spencer

__
r-h...@r-project.org  mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide

http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

__
r-h...@r-project.org  mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Spencer Graves, PE, PhD
President and Chief Technology Officer
Structure Inspection and Monitoring, Inc.
751 Emerson Ct.
San José, CA 95126
ph:  408-655-4567
web:  www.structuremonitoring.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] big data?

2014-08-06 Thread Mike Harwood
The read.table.ffdf function in the ff package can read in delimited files 
and store them to disk as individual columns.  The ffbase package provides 
additional data management and analytic functionality.  I have used these 
packages on 15 Gb files of 18 million rows and 250 columns.


On Tuesday, August 5, 2014 1:39:03 PM UTC-5, David Winsemius wrote:
>
>
> On Aug 5, 2014, at 10:20 AM, Spencer Graves wrote: 
>
> >  What tools do you like for working with tab delimited text files up 
> to 1.5 GB (under Windows 7 with 8 GB RAM)? 
>
> ?data.table::fread 
>
> >  Standard tools for smaller data sometimes grab all the available 
> RAM, after which CPU usage drops to 3% ;-) 
> > 
> > 
> >  The "bigmemory" project won the 2010 John Chambers Award but "is 
> not available (for R version 3.1.0)". 
> > 
> > 
> >  findFn("big data", 999) downloaded 961 links in 437 packages. That 
> contains tools for data PostgreSQL and other formats, but I couldn't find 
> anything for large tab delimited text files. 
> > 
> > 
> >  Absent a better idea, I plan to write a function getField to 
> extract a specific field from the data, then use that to split the data 
> into 4 smaller files, which I think should be small enough that I can do 
> what I want. 
>
> There is the colbycol package with which I have no experience, but I 
> understand it is designed to partition data into column sized objects. 
> #--- from its help file- 
> cbc.get.col {colbycol}R Documentation 
> Reads a single column from the original file into memory 
>
> Description 
>
> Function cbc.read.table reads a file, stores it column by column in disk 
> file and creates a colbycol object. Functioncbc.get.col queries this object 
> and returns a single column. 
>
> >  Thanks, 
> >  Spencer 
> > 
> > __ 
> > r-h...@r-project.org  mailing list 
> > https://stat.ethz.ch/mailman/listinfo/r-help 
> > PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html 
> > and provide commented, minimal, self-contained, reproducible code. 
>
> David Winsemius 
> Alameda, CA, USA 
>
> __ 
> r-h...@r-project.org  mailing list 
> https://stat.ethz.ch/mailman/listinfo/r-help 
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html 
> and provide commented, minimal, self-contained, reproducible code. 
>
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] big data?

2014-08-05 Thread David Winsemius

On Aug 5, 2014, at 10:20 AM, Spencer Graves wrote:

>  What tools do you like for working with tab delimited text files up to 
> 1.5 GB (under Windows 7 with 8 GB RAM)?

?data.table::fread

>  Standard tools for smaller data sometimes grab all the available RAM, 
> after which CPU usage drops to 3% ;-)
> 
> 
>  The "bigmemory" project won the 2010 John Chambers Award but "is not 
> available (for R version 3.1.0)".
> 
> 
>  findFn("big data", 999) downloaded 961 links in 437 packages. That 
> contains tools for data PostgreSQL and other formats, but I couldn't find 
> anything for large tab delimited text files.
> 
> 
>  Absent a better idea, I plan to write a function getField to extract a 
> specific field from the data, then use that to split the data into 4 smaller 
> files, which I think should be small enough that I can do what I want.

There is the colbycol package with which I have no experience, but I understand 
it is designed to partition data into column sized objects.
#--- from its help file-
cbc.get.col {colbycol}  R Documentation
Reads a single column from the original file into memory

Description

Function cbc.read.table reads a file, stores it column by column in disk file 
and creates a colbycol object. Functioncbc.get.col queries this object and 
returns a single column.

>  Thanks,
>  Spencer
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] big data?

2014-08-05 Thread Peter Langfelder
Have you tried read.csv.sql from package sqldf?

Peter

On Tue, Aug 5, 2014 at 10:20 AM, Spencer Graves
 wrote:
>   What tools do you like for working with tab delimited text files up to
> 1.5 GB (under Windows 7 with 8 GB RAM)?
>
>
>   Standard tools for smaller data sometimes grab all the available RAM,
> after which CPU usage drops to 3% ;-)
>
>
>   The "bigmemory" project won the 2010 John Chambers Award but "is not
> available (for R version 3.1.0)".
>
>
>   findFn("big data", 999) downloaded 961 links in 437 packages. That
> contains tools for data PostgreSQL and other formats, but I couldn't find
> anything for large tab delimited text files.
>
>
>   Absent a better idea, I plan to write a function getField to extract a
> specific field from the data, then use that to split the data into 4 smaller
> files, which I think should be small enough that I can do what I want.
>
>
>   Thanks,
>   Spencer
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Big Data reading subsample csv

2012-08-16 Thread Greg Snow
The read.csv.sql function in the sqldf package may make this approach
quite simple.

On Thu, Aug 16, 2012 at 10:12 AM, jim holtman  wrote:
> Why not put this into a database, and then you can easily extract the
> records you want specifying the record numbers.  You play the one time
> expense of creating the database, but then have much faster access to
> the data as you make subsequent runs.
>
> On Thu, Aug 16, 2012 at 9:44 AM, Tudor Medallion
>  wrote:
>> Hello,
>>
>> I'm most grateful for your time to read this.
>>
>> I have a uber size 30GB file of 6 million records and 3000 (mostly
>> categorical data) columns in csv format. I want to bootstrap subsamples for
>> multinomial regression, but it's proving difficult even with my 64GB RAM
>>  in my machine and twice that swap file , the process becomes super slow
>> and halts.
>>
>> I'm thinking about generating subsample indicies in R and feeding them into
>> a system command using sed or awk, but don't know how to do this. If
>> someone knew of a clean way to do this using just R commands, I would be
>> really grateful.
>>
>> One problem is that I need to pick complete observations of subsamples,
>> that is I need to have all the rows of a particular multinomial observation
>> - they are not the same length from observation to observation. I plan to
>> use glmnet and then some fancy transforms to get an approximation to the
>> multinomial case. One other point is that I don't know how to choose sample
>> size to fit around memory limits.
>>
>> Appreciate your thoughts greatly.
>>
>>
>>> R.version
>>
>> platform   x86_64-pc-linux-gnu
>> arch   x86_64
>> os linux-gnu
>> system x86_64, linux-gnu
>> status
>> major  2
>> minor  15.1
>> year   2012
>> month  06
>> day22
>> svn rev59600
>> language   R
>> version.string R version 2.15.1 (2012-06-22)
>> nickname   Roasted Marshmallows
>>
>>
>> tags: read.csv(), system(), awk, sed, sample(), glmnet, multinomial, MASS.
>>
>> Yoda
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Gregory (Greg) L. Snow Ph.D.
538...@gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Big Data reading subsample csv

2012-08-16 Thread jim holtman
Why not put this into a database, and then you can easily extract the
records you want specifying the record numbers.  You play the one time
expense of creating the database, but then have much faster access to
the data as you make subsequent runs.

On Thu, Aug 16, 2012 at 9:44 AM, Tudor Medallion
 wrote:
> Hello,
>
> I'm most grateful for your time to read this.
>
> I have a uber size 30GB file of 6 million records and 3000 (mostly
> categorical data) columns in csv format. I want to bootstrap subsamples for
> multinomial regression, but it's proving difficult even with my 64GB RAM
>  in my machine and twice that swap file , the process becomes super slow
> and halts.
>
> I'm thinking about generating subsample indicies in R and feeding them into
> a system command using sed or awk, but don't know how to do this. If
> someone knew of a clean way to do this using just R commands, I would be
> really grateful.
>
> One problem is that I need to pick complete observations of subsamples,
> that is I need to have all the rows of a particular multinomial observation
> - they are not the same length from observation to observation. I plan to
> use glmnet and then some fancy transforms to get an approximation to the
> multinomial case. One other point is that I don't know how to choose sample
> size to fit around memory limits.
>
> Appreciate your thoughts greatly.
>
>
>> R.version
>
> platform   x86_64-pc-linux-gnu
> arch   x86_64
> os linux-gnu
> system x86_64, linux-gnu
> status
> major  2
> minor  15.1
> year   2012
> month  06
> day22
> svn rev59600
> language   R
> version.string R version 2.15.1 (2012-06-22)
> nickname   Roasted Marshmallows
>
>
> tags: read.csv(), system(), awk, sed, sample(), glmnet, multinomial, MASS.
>
> Yoda
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Big data and column correspondence problem

2011-07-27 Thread murilofm
Daniel, thanks for the help. I finally made it, doing the merging separately.


Daniel Malter wrote:
> 
> On a different note: how are you matching if AA has multiple matches in
> BB?
> 

About that, all I have to do is check whether, for any of the BB which
matches with AA, the indicator equals 1. If not, the dummy variable assume
the value 0 in the original data.

Again, thank you for the pacience.

--
View this message in context: 
http://r.789695.n4.nabble.com/Big-data-and-column-correspondence-problem-tp3694912p3699988.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Big data and column correspondence problem

2011-07-26 Thread Daniel Malter
If A has more columns than in your example, you could always try to only
merge those columns of A with B that are relevant for the merging. You could
then cbind the result of the merging back together with the rest of A as
long as the merged data preserved the same order as in A.

Alternatively, you can always use chunks of A and do the merging separately,
e.g., for blocks of 1 observations or so.

x<-sample(1:150,15000,replace=T)
y<-sample(1:150,15000,replace=T)
a<-rnorm(15000)
b<-rnorm(15000)
A<-cbind(x,a)
B<-cbind(y,b)
system.time(newdata<-merge(A,B,by.x='x',by.y='y',all.x=T,all.y=F))

On a MacBook Pro with 4 Gs of RAM and a 2.4 GHz Duo Core processor it would
take you about 40 minutes if you do chunks for 15000 observations. I am not
sure whether the loop would be slower than that.

On a different note: how are you matching if AA has multiple matches in BB?

Daniel


murilofm wrote:
> 
> Thanks Daniel, that helped me. Based on your suggestions I built this
> final code:
> 
> library(foreign)
> library(gdata)
> 
> AA = c(4,4,4,2,2,6,8,9) 
> A1 = c(3,3,11,5,5,7,11,12) 
> A2 = c(3,3,7,3,5,7,11,12) 
> A = cbind(AA, A1, A2) 
> 
> BB = c(2,2,4,6,6) 
> B1 =c(5,11,7,13,NA) 
> B2 =c(4,12,11,NA,NA) 
> B3 =c(12,13,NA,NA,NA) 
> 
> A = cbind(AA, A1, A2,0) 
> B=cbind(BB,B1,B2,B3) 
> 
> newdata<-merge(A,B,by.x='AA',by.y='BB',all.x=T,all.y=F)
> newdata$dum <- rowSums (newdata[,matchcols(newdata,
> with=c("B"))]==newdata$A1, na.rm = FALSE, dims = 1)*
> rowSums (newdata[,matchcols(newdata, with=c("B"))]==newdata$A2, na.rm =
> FALSE, dims = 1)
> 
> colnames(A)[4]<-"dum"
> newdata$dum1<-newdata$dum
> A_final<-merge(A,newdata,by.x=c("AA","A1","A2","dum"),by.y=c("AA","A1","A2","dum"),all.x=T,all.y=F)
> 
> Which gives me the same result of the "loop" version. Unfortunately, I
> can't replicate it on the original data since i can't make the merge work:
> i get an error message "Reached total allocation of 4090Mb". So, I'm stuck
> again.
> 
> If anyone could shed some light on this problem, i would really
> appreciate.
> 

--
View this message in context: 
http://r.789695.n4.nabble.com/Big-data-and-column-correspondence-problem-tp3694912p3697709.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Big data and column correspondence problem

2011-07-26 Thread murilofm
Thanks Daniel, that helped me. Based on your suggestions I built this final
code:

library(foreign)
library(gdata)

AA = c(4,4,4,2,2,6,8,9) 
A1 = c(3,3,11,5,5,7,11,12) 
A2 = c(3,3,7,3,5,7,11,12) 
A = cbind(AA, A1, A2) 

BB = c(2,2,4,6,6) 
B1 =c(5,11,7,13,NA) 
B2 =c(4,12,11,NA,NA) 
B3 =c(12,13,NA,NA,NA) 

A = cbind(AA, A1, A2,0) 
B=cbind(BB,B1,B2,B3) 

newdata<-merge(A,B,by.x='AA',by.y='BB',all.x=T,all.y=F)
newdata$dum <- rowSums (newdata[,matchcols(newdata,
with=c("B"))]==newdata$A1, na.rm = FALSE, dims = 1)*
rowSums (newdata[,matchcols(newdata, with=c("B"))]==newdata$A2, na.rm =
FALSE, dims = 1)

colnames(A)[4]<-"dum"
newdata$dum1<-newdata$dum
A_final<-merge(A,newdata,by.x=c("AA","A1","A2","dum"),by.y=c("AA","A1","A2","dum"),all.x=T,all.y=F)

Which gives me the same result of the "loop" version. Unfortunately, I can't
replicate it on the original data since i can't make the merge work: i get
an error message "Reached total allocation of 4090Mb". So, I'm stuck again.

If anyone could shed some light on this problem, i would really appreciate.

--
View this message in context: 
http://r.789695.n4.nabble.com/Big-data-and-column-correspondence-problem-tp3694912p3697557.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Big data and column correspondence problem

2011-07-26 Thread Daniel Malter
This is much clearer. So here is what I think you want to do. In theory and
practice:

Theory: 

Check if AA[i] is in BB

If AA[i] is in BB, then take the row where BB[j] == AA[i] and check whether
A1 and A2 are in B1 to B3. Is that right? Only if both are, you want the
indicator to take 1.

Here is how you do this:

newdata<-merge(A,B,by.x='AA',by.y='BB',all.x=F,all.y=F)

A1.check<-with(newdata,A1==B1|A1==B2|A1==B3)
B1.check<-with(newdata,A2==B1|A1==B2|A1==B3)

A1.check<-replace(A1.check,which(is.na(A1.check)),0)
B1.check<-replace(B1.check,which(is.na(B1.check)),0)

newdata<-data.frame(newdata,A1.check,B1.check)

newdata$index<-with(newdata,ifelse(A1.check+B1.check==2,1,0))

HTH,
Daniel


murilofm wrote:
> 
>>>I can not see A1[1]=20 in your example data. 
> 
> Sorry about the typos A1[1]=3.
> 
>>> Why B[3,]?
> 
> Because AA[1]=BB[3]=4.
> 
> I will reformulate the example with the code I'm running:
> 
> AA = c(4,4,4,2,2,6,8,9) 
> A1 = c(3,3,11,5,5,7,11,12) 
> A2 = c(3,3,7,3,5,7,11,12) 
> A = cbind(AA, A1, A2)
> 
> BB = c(2,2,4,6,6) 
> B1 =c(5,11,7,13,NA) 
> B2 =c(4,12,11,NA,NA) 
> B3 =c(12,13,NA,NA,NA) 
> 
> A = cbind(AA, A1, A2,0) 
> B=cbind(BB,B1,B2,B3)
> 
> for(i in 1:dim(A)[1]){
> if (!is.na(sum(match(A[i,2:3],B[B[,1]==A[i,1],2:dim(B)[2]] A[i,4]<-1
> }
> 
> Thanks
> 

--
View this message in context: 
http://r.789695.n4.nabble.com/Big-data-and-column-correspondence-problem-tp3694912p3697067.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Big data and column correspondence problem

2011-07-26 Thread Petr PIKAL
Hi

> Re: [R] Big data and column correspondence problem
> 
> Daniel, thanks for the answer.
> I will try to make myself i little bit clearer. Doing step by step I 
would
> have (using a loop trough the lines of 'A'):

I am not sure if you are successful in your clarifying.

> 
> 1. AA[1] is 4. As so, I would have to compare A1[1] = 20 and A2[1] =3 
with

I can not see A1[1]=20 in your example data.

> A[1,]
AA A1 A2 
 4  3  3 

gives me this.


> 
>B1 B2 B3
> B[3,2:4] 7 11 NA

Why B[3,]? 

> 
> beacause BB[3]=4. Since there is no match, this would retrieve me a 
zero.
> The same would happen with AA[2]. For AA[3] I have 
> 
>  AA A1 A2
> [3,]  4 11  7
> 
> Since  both A1[3] = 20 and A2[3] =3 match with B[3,2:4] this would 
retrieve
> me 1.

In what sense those two lines match?
A[3,]
AA A1 A2 
 4  5  5 
B[3,]
BB B1 B2 B3 
 4  7 11 NA 

I must say I am completely lost.

Maybe you could try to present a code with your toy data which give 
desired result but is too slow with original data.

Regards
Petr


> 
> 2. For AA[4:5] i would have to compare each line with B[1:2,2:4]. That 
is,
> for AA[4]=2 i have a match with BB[1] and BB[2]. Then I have to compare
> 
>   A1 A2
> [4,]  5  5
> 
> with 
> 
>B1 B2 B3
> B[1,2:4] 5  3  12
> 
> and
> 
> B1 B2 B3
> B[2,2:4] 11 12 13
> 
> Again, for A1[4] and A2[4] and would have no match. But A1[5] and A1[5]
> match with B2[1] and B1[1].
> 
> 3. And so on for the other lines of A.
> 
> The problem is that if I perform that as a loop it really takes to long.
> Hope i could make it clearer.
> 
> --
> View this message in context: 
http://r.789695.n4.nabble.com/Big-data-and-
> column-correspondence-problem-tp3694912p3695795.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Big data and column correspondence problem

2011-07-26 Thread murilofm
Daniel, thanks for the answer.
I will try to make myself i little bit clearer. Doing step by step I would
have (using a loop trough the lines of 'A'):

1. AA[1] is 4. As so, I would have to compare A1[1] = 20 and A2[1] =3 with

   B1 B2 B3
B[3,2:4] 7 11 NA

beacause BB[3]=4. Since there is no match, this would retrieve me a zero.
The same would happen with AA[2]. For AA[3] I have 

 AA A1 A2
[3,]  4 11  7

Since  both A1[3] = 20 and A2[3] =3 match with B[3,2:4] this would retrieve
me 1.

2. For AA[4:5] i would have to compare each line with B[1:2,2:4]. That is,
for AA[4]=2 i have a match with BB[1] and BB[2]. Then I have to compare

  A1 A2
[4,]  5  5

with 

   B1 B2 B3
B[1,2:4] 5  3  12

and

B1 B2 B3
B[2,2:4] 11 12 13

Again, for A1[4] and A2[4] and would have no match. But A1[5] and A1[5]
match with B2[1] and B1[1].

3. And so on for the other lines of A.

The problem is that if I perform that as a loop it really takes to long.
Hope i could make it clearer.

--
View this message in context: 
http://r.789695.n4.nabble.com/Big-data-and-column-correspondence-problem-tp3694912p3695795.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Big data and column correspondence problem

2011-07-26 Thread Daniel Malter
For question (a), do:

which(AA%in%BB)

Question (b) is very ambiguous to me. It makes little sense for your example
because all values of BB are in AA. Therefore I am wondering whether you
meant in question (a) that you want to find all values in BB that are in AA.
That's not the same thing. I am also not sure what exactly you mean by
"within the lines of B that correspond to the values of AA. If you mean "all
the lines of B that for which AA is in BB, then you get that by:

B[which(AA%in%BB) , ]

However, this gives an error because AA has more values in BB than the
number of rows in B. This leads me to believe that you might want 

which(BB%in%AA) 

for question (a). In this case you would get the lines of B by

B[which(BB%in%AA) , ]

which in this example are all rows of B.


Again, part (b) is very opaque to me. It would help if you described it step
by step as to what it should and what the outcome of every step along the
way should be. Just from the final result that it should produce and your
description, I cannot make sense of it. But maybe another helper can.

Daniel



murilofm wrote:
> 
> Greetings,
> 
> I've been struggling for some time with a problem concerning a big
> database that i have to deal with.
> I'll try to exemplify my problem since the database is really big.
> Suppose I have the following data:
> 
> AA = c(4,4,4,2,2,6,8,9)
> A1 = c(3,3,5,5,5,7,11,12)
> A2 = c(3,3,5,5,5,7,11,12)
> A = cbind(A, A1, A2)
> 
> BB = c(2,2,4,6,6)
> B1 =c(5,11,7,13,NA)
> B2 =c(3,12,11,NA,NA)
> B3 =c(12,13,NA,NA,NA)
> 
> B=cbind(BB,B1,B2,B3)
> 
> I have to do the following:
> 
> 1. Create a dummy (binary) variable in a new column of A that indicates
> if, for each row:
> a) the value from the column AA can be found in BB
> b) within the lines of B that corresponds to the value of AA, I can find
> both A1 and A2 in B1, B2 or B3.
> 
> In this example i would have
> [0,0,1,1,1,0,0,0]
> 
> I been able to do it with some loops; the problem is that since in the
> original data A has 2.936.044 lines and B has 14.965 it's taking forever
> to finish (probably because I might be doing the wrong way).
> 
> I would really appreciate any help or advice on how to deal with this.
> Thanks!
> 

--
View this message in context: 
http://r.789695.n4.nabble.com/Big-data-and-column-correspondence-problem-tp3694912p3695065.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Big data (over 2GB) and lmer

2010-10-23 Thread Douglas Bates
On Thu, Oct 21, 2010 at 2:00 PM, Ben Bolker  wrote:
> Michal Figurski  mail.med.upenn.edu> writes:
>
>> I have a data set of roughly 10 million records, 7 columns. It has only
>> about 500MB as a csv, so it fits in the memory. It's painfully slow to
>> do anything with it, but it's possible. I also have another dataset of
>> covariates that I would like to explore - with about 4GB of data...
>>
>> I would like to merge the two datasets and use lmer to build a mixed
>> effects model. Is there a way, for example using 'bigmemory' or 'ff', or
>> any other trick, to enable lmer to work on this data set?
>
>   I don't think this will be easy.
>
>   Do you really need mixed effects for this task?  i.e., are
> your numbers per group sufficiently small that you will benefit
> from the shrinkage etc. afforded by mixed models?  If you have
> (say) 1 individuals per group, 1000 groups, then I would
> expect you'd get very accurate estimates of the group coefficients,
> you could then calculate variances etc. among these estimates.
>
>   You might get more informed answers on r-sig-mixed-mod...@r-project.org ...

lmer already stores the model matrices and factors related to the
random effects as sparse matrices.  Depending on the complexity of the
model - in particular, if random effects are defined with respect to
more than one grouping factor and, if so, if those factors are nested
or not - storing the Cholesky factor of the random effects model
matrix will be the limiting factor.  This object has many slots but
only two very large ones in the sense that they are long vectors.  At
present vectors accessed or created by R are limited to 2^31 elements
because they are indexed by 32-bit integers.

So the short answer is, "it depends".  Simple models may be possible.
Complex models will need to wait upon decisions about using wider
integer representations in indices.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] big data and lmer

2010-10-22 Thread Jay Emerson
Though bigmemory, ff, and other big data solutions (databases, etc...)
can help easily manage massive data, their data objects are not
natively compatible with all the advanced functionality of R.
Exceptions include lm and glm (both ff and bigmemory support his via
Lumley's biglm package), kmeans, and perhaps a few other things.  In
many cases, it's just a matter of someone deciding to port a
tool/analysis of interest to one of these different object types -- we
welcome collaborators and would be happy to offer advice if you want
to adapt something for bigmemory structures!

Jay

-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Big data (over 2GB) and lmer

2010-10-21 Thread Ben Bolker
Michal Figurski  mail.med.upenn.edu> writes:

> I have a data set of roughly 10 million records, 7 columns. It has only 
> about 500MB as a csv, so it fits in the memory. It's painfully slow to 
> do anything with it, but it's possible. I also have another dataset of 
> covariates that I would like to explore - with about 4GB of data...
> 
> I would like to merge the two datasets and use lmer to build a mixed 
> effects model. Is there a way, for example using 'bigmemory' or 'ff', or 
> any other trick, to enable lmer to work on this data set?

   I don't think this will be easy.

   Do you really need mixed effects for this task?  i.e., are
your numbers per group sufficiently small that you will benefit
from the shrinkage etc. afforded by mixed models?  If you have
(say) 1 individuals per group, 1000 groups, then I would
expect you'd get very accurate estimates of the group coefficients,
you could then calculate variances etc. among these estimates.

   You might get more informed answers on r-sig-mixed-mod...@r-project.org ...

  Ben Bolker

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] big data

2010-09-08 Thread Greg Snow
In addition to Dirks advice about the biglm package, you may also want to look 
at the RSQLite and SQLiteDF packages which may make dealing with the large 
dataset faster and easier, especially for passing the chunks to bigglm.

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.s...@imail.org
801.408.8111


> -Original Message-
> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-
> project.org] On Behalf Of André de Boer
> Sent: Wednesday, September 08, 2010 5:27 AM
> To: r-help@r-project.org
> Subject: [R] big data
> 
> Hello,
> 
> I searched the internet but i didn't find the answer for the next
> problem:
> I want to do a glm on a csv file consisting of 25 columns and 4 mln
> rows.
> Not all the columns are relevant. My problem is to read the data into
> R.
> Manipulate the data and then do a glm.
> 
> I've tried with:
> 
> dd<-scan("myfile.csv",colClasses=classes)
> dat<-as.data.frame(dd)
> 
> My question is: what is the right way to do is?
> Can someone give me a hint?
> 
> Thanks,
> Arend
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] big data

2010-09-08 Thread Dirk Eddelbuettel

On 8 September 2010 at 13:26, André de Boer wrote:
| I searched the internet but i didn't find the answer for the next problem:
| I want to do a glm on a csv file consisting of 25 columns and 4 mln rows.
| Not all the columns are relevant. My problem is to read the data into R.
| Manipulate the data and then do a glm.
| 
| I've tried with:
| 
| dd<-scan("myfile.csv",colClasses=classes)
| dat<-as.data.frame(dd)
| 
| My question is: what is the right way to do is?
| Can someone give me a hint?

Look at the biglm package by Thomas Lumley which will allow you to fit glm
models in "chunks".  

Dirk

-- 
Dirk Eddelbuettel | e...@debian.org | http://dirk.eddelbuettel.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] big data file versus ram memory

2008-12-18 Thread David Winsemius


On Dec 18, 2008, at 3:07 PM, Stephan Kolassa wrote:


Hi Mauricio,

Mauricio Calvao schrieb:
1) I would like very much to use R for processing some big data  
files (around 1.7 or more GB) for spatial analysis, wavelets, and  
power spectra estimation; is this possible with R? Within IDL, such  
a big data set seems to be tractable...


There are some packages to handle large datasets, e.g., bigmemoRy.  
There were a couple of presentations on various ways to work with  
large datasets at the last useR conference - take a look at the  
presentations at

http://www.statistik.uni-dortmund.de/useR-2008/
You'll probably be most interested in the "High Performance" streams.

2) I have heard/read that R "puts all its data on ram"? Does this  
really mean my data file cannot be bigger than my ram memory?


The philosophy is basically to use RAM. Anything working outside RAM  
is not exactly heretical to R, but it does require some additional  
effort.


3) If I have a big enough ram, would I be able to process whatever  
data set?? What constrains the practical limits of my data sets??


From what I understand - little to nothing, beyond the time needed  
for computations.


Er, ... it depends. At a minimum a person considering this should have  
read the FAQs. If this is a question about Windows, then R-Win FAQ 2.9:


http://cran.r-project.org/bin/windows/base/rw-FAQ.html#There-seems-to-be-a-limit-on-the-memory-it-uses_0021

There has been quite a bit about this in the list over the last couple  
of years. Search the archives:

http://search.r-project.org/

--
David Winsemius




HTH,
Stephan

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] big data file versus ram memory

2008-12-18 Thread Stephan Kolassa

Hi Mauricio,

Mauricio Calvao schrieb:
1) I would like very much to use R for processing some big data files 
(around 1.7 or more GB) for spatial analysis, wavelets, and power 
spectra estimation; is this possible with R? Within IDL, such a big data 
set seems to be tractable...


There are some packages to handle large datasets, e.g., bigmemoRy. There 
were a couple of presentations on various ways to work with large 
datasets at the last useR conference - take a look at the presentations at

http://www.statistik.uni-dortmund.de/useR-2008/
You'll probably be most interested in the "High Performance" streams.

2) I have heard/read that R "puts all its data on ram"? Does this really 
mean my data file cannot be bigger than my ram memory?


The philosophy is basically to use RAM. Anything working outside RAM is 
not exactly heretical to R, but it does require some additional effort.


3) If I have a big enough ram, would I be able to process whatever data 
set?? What constrains the practical limits of my data sets??


From what I understand - little to nothing, beyond the time needed for 
computations.


HTH,
Stephan

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.