On Mar 8, 2013, at 10:59 AM, David Winsemius wrote: > > On Mar 8, 2013, at 9:31 AM, David Winsemius wrote: > >> >> On Mar 8, 2013, at 6:01 AM, Jan van der Laan wrote: >> >>> >>> You could use the fact that scan reads the data rowwise, and the fact that >>> arrays are stored columnwise: >>> >>> # generate a small example dataset >>> exampl <- array(letters[1:25], dim=c(5,5)) >>> write.table(exampl, file="example.dat", row.names=FALSE. col.names=FALSE, >>> sep="\t", quote=FALSE) >>> >> >> This might avoid creation of some of the intermediate copies: >> >> MASS::write.matrix( matrix( scan("example.dat", what=character()), 5,5), >> file="fil.out") >> >> I tested it up to a 5000 x 5000 file: >> >>> exampl <- array(letters[1:25], dim=c(5000,5000)) >>> MASS::write.matrix( matrix( scan("example.dat", what=character()), >>> 5000,5000), file="fil.out") >> Read 25000000 items >>> >> >> Not sure of the exact timing. Probably 5-10 minutes. The exampl-object takes >> 200,001,400 bytes. and did not noticeably stress my machine. Most of my RAM >> remains untouched. I'm going out on errands and will run timing on a 10K x >> 10K test case within a system.time() enclosure. Scan did report successfully >> reading 100000000 items fairly promptly. >> > >> system.time( {MASS::write.matrix( matrix( scan("example.dat", >> what=character()), 10000,10000), file="fil.out") } ) > Read 100000000 items > user system elapsed > 487.100 912.613 1415.228 > >> system.time( {MASS::write.matrix( matrix( scan("example.dat", >> what=character()), 500,500), file="fil.out") } ) > Read 250000 items > user system elapsed > 1.184 2.507 3.834 > > And so it seems to scale linearly: > >> 3.834 * 100000000/250000 > [1] 1533.6
However, another posting today reminds us that this would best be attempted in a version of R that can handle matrices of that are larger than 2^15-1: > 10000^2 <= 2^31-1 [1] TRUE > 60000^2 <= 2^31-1 [1] FALSE R 3.0 is scheduled for release soon and you can compile it from sources if your machine is properly equipped. It has larger integers, and I _think_ may support such larger matrices. -- David. > >> -- >> David. >> >>> # and read... >>> d <- scan("example.dat", what=character()) >>> d <- array(d, dim=c(5,5)) >>> >>> t(exampl) == d >>> >>> >>> Although this is probably faster, it doesn't help with the large size. You >>> could used the n option of scan to read chunks/blocks and feed those to, >>> for example, an ff array (which you ideally have preallocated). >>> >>> HTH, >>> >>> Jan >>> >>> >>> >>> >>> peter dalgaard <pda...@gmail.com> schreef: >>> >>>> On Mar 7, 2013, at 01:18 , Yao He wrote: >>>> >>>>> Dear all: >>>>> >>>>> I have a big data file of 60000 columns and 60000 rows like that: >>>>> >>>>> AA AC AA AA .......AT >>>>> CC CC CT CT.......TC >>>>> .......................... >>>>> ......................... >>>>> >>>>> I want to transpose it and the output is a new like that >>>>> AA CC ............ >>>>> AC CC............ >>>>> AA CT............. >>>>> AA CT......... >>>>> .................... >>>>> .................... >>>>> AT TC............. >>>>> >>>>> The keypoint is I can't read it into R by read.table() because the >>>>> data is too large,so I try that: >>>>> c<-file("silygenotype.txt","r") >>>>> geno_t<-list() >>>>> repeat{ >>>>> line<-readLines(c,n=1) >>>>> if (length(line)==0)break #end of file >>>>> line<-unlist(strsplit(line,"\t")) >>>>> geno_t<-cbind(geno_t,line) >>>>> } >>>>> write.table(geno_t,"xxx.txt") >>>>> >>>>> It works but it is too slow ,how to optimize it??? >>>> >>>> >>>> As others have pointed out, that's a lot of data! >>>> >>>> You seem to have the right idea: If you read the columns line by line >>>> there is nothing to transpose. A couple of points, though: >>>> >>>> - The cbind() is a potential performance hit since it copies the list >>>> every time around. geno_t <- vector("list", 60000) and then >>>> geno_t[[i]] <- <etc> >>>> >>>> - You might use scan() instead of readLines, strsplit >>>> >>>> - Perhaps consider the data type as you seem to be reading strings with 16 >>>> possible values (I suspect that R already optimizes string storage to make >>>> this point moot, though.) >>>> >>>> -- >>>> Peter Dalgaard, Professor >>>> Center for Statistics, Copenhagen Business School >>>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark >>>> Phone: (+45)38153501 >>>> Email: pd....@cbs.dk Priv: pda...@gmail.com >>>> >>>> ______________________________________________ > snipped > David Winsemius > Alameda, CA, USA > David Winsemius Alameda, CA, USA ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.