On Fri, Sep 19, 2014 at 10:07 AM, Stephen HK Wong <hon...@stanford.edu> wrote: > Thanks Henrick. Seems it fits my needs. One my question is the argument, > length.out=0.10*n, is it "randomly" taking out 10% ? I found it basically > takes every 10th row if I put length.out=0.1*n, and every 100th row if I put > length.out=0.01*n till the end. I couldn't find this information on > documentation.
If you look at the call, argument 'rows' is just an integer (index) vector that specified which rows to read. I used seq(from=1, to=n, length.out=0.10*n) as an illustration. See ?seq for how that works. If you want to get a random sample, I recommend to use sample() to generate that index vector. If you're going to read the same data file many many times, I recommend to also look into what Greg suggested, particularly 'sqldf' which does not take that much to learn. /Henrik > > Stephen HK Wong > Stanford, California 94305-5324 > > ----- Original Message ----- > From: Henrik Bengtsson <h...@biostat.ucsf.edu> > To: Stephen HK Wong <hon...@stanford.edu> > Cc: r-help@r-project.org > Sent: Thu, 18 Sep 2014 18:33:15 -0700 (PDT) > Subject: Re: [R] read.table() 1Gb text dataframe > > As a start, make sure you specify the 'colClasses' argument. BTW, > using that you can even go to the extreme and read one column at the > time, if it comes down to that. > > To read a 10% subset of the rows, you can use R.filesets as: > > library(R.filesets) > db <- TabularTextFile(pathname) > n <- nbrOfRows(db) > data <- readDataFrame(db, rows=seq(from=1, to=n, length.out=0.10*n)) > > It is also useful to specify 'colClasses' here. In addition to > specifying them ordered by column, as for read.table(), you also > specify them by column names (or regular expressions of the column > names), e.g. > > data <- readDataFrame(db, colClasses=c("*"="NULL", "(x|y)"="integer", > outcome="numeric", "id"="character"), rows=seq(from=1, to=n, > length.out=0.10*n)) > > That 'colClasses' specifies that the default is drop all columns, read > columns 'x' and 'y' as integers, and so on. > > BTW, if you know 'n' upfront you can skip the setup of TabularTextFile > and just do: > > data <- readDataFrame(pathname, rows=seq(from=1, to=n, length.out=0.10*n)) > > > Hope this helps > > Henrik > > On Thu, Sep 18, 2014 at 4:48 PM, Stephen HK Wong <hon...@stanford.edu> wrote: >> Dear All, >> >> I have a table of 4 columns and many millions rows separated by >> tab-delimited. I don't have enough memory to read.table in that 1 Gb file. >> And actually I have 12 text files like that. Is there a way that I can just >> randomly read.table() in 10% of rows ? I was able to do that using colbycol >> package, but it is not not available. Many thanks!! >> >> >> >> Stephen HK Wong >> Stanford, California 94305-5324 >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.