Here is how much time it took to read a file with 10 lines and 700,000 columns per line separated with comma:
> system.time(input <- scan("/tempxx.txt", what=0, sep=',')) Read 7000000 items user system elapsed 15.62 0.22 15.84 > object.size(input) 56000024 bytes > 'scan' should be sufficient and it will not take another 10 minutes in awk. On Fri, Sep 25, 2009 at 1:17 PM, Charles C. Berry <cbe...@tajo.ucsd.edu> wrote: > On Fri, 25 Sep 2009, Ping-Hsun Hsieh wrote: > >> Thanks, Ben. >> >> The matrix is a pure numeric matrix (6x700000, 31mb). >> I tried the colClasses='numeric' as well as nrows=7(one of these is header >> line) on the matrix. >> Also I tested it with not setting the two options in read.delim() > > > A couple of things come to mind. > > First, I have not read the internals of scan, but suspect that parsing a > really long line may be slowing things down. > > Since you are attempting to read in a numeric matrix, you can simply do a > global replacement of your delimiter with a newline and use scan on the > result. On unix-like systems, something like > > tmp <- scan( pipe( 'tr "\t" "\n" < test_data.txt' ) ) > > ought to help. > > Second, the memory occupied by each line - once it has been processed - is > spread over the full 32MB (or 3.2 GB for the 600 by 700000 version) region > of memory. I am guessing that this is causing your cache to work hard to put > it in place. > > If you really want the result to be a 600 by 700000 matrix, you might try to > read it in smaller blocks using scan( pipe( "cut ... " ) ) to feed selected > blocks of columns of your text file to R. > > HTH, > > Chuck > > >> >> Here is the time spent on reading the matrix for each test. >> >>> system.time( tmp <- read.delim("test_data.txt")) >> >> user system elapsed >> 50985.421 27.665 51013.384 >> >>> system.time(tmp <- >>> read.delim("test_data.txt",colClasses="numeric",nrows=7,comment.char="")) >> >> user system elapsed >> 51301.563 60.491 51362.208 >> >> It seems setting the options does not speed up the reading at all. >> Is it because of the header line? I will test it. >> Did I misunderstand something? >> >> One additional and interesting observation: >> The one with the options does save memory a lot. It took ~150mb, while the >> other took ~4GB for reading the matrix. >> >> I will try the scan() and see if it helps. >> >> Thanks! >> Mike >> >> >> -----Original Message----- >> From: Benilton Carvalho [mailto:bcarv...@jhsph.edu] >> Sent: Wednesday, September 23, 2009 4:56 PM >> To: Ping-Hsun Hsieh >> Cc: r-help@r-project.org >> Subject: Re: [R] read.delim very slow in reading files with lots of >> columns >> >> use the 'colClasses' argument and you can also set 'nrows'. >> >> b >> >> On Sep 23, 2009, at 8:24 PM, Ping-Hsun Hsieh wrote: >> >>> Hi, >>> >>> >>> >>> I am trying to read a tab-delimited file into R (Ver. 2.8). The >>> machine I am using is 64bit Linux with 16 GB. >>> >>> The file is basically a matrix(~600x700000) and as large as 3GB. >>> >>> >>> >>> The read.delim() ran extremely slow (hours) even with a subset of >>> the file (31 MB with 6x700000) >>> >>> I monitored the memory usage, and found it constantly only took less >>> than 1% of 16GB memory. >>> >>> Does read.delim() have difficulty to read files with lots of columns? >>> >>> Any suggestions? >>> >>> >>> >>> Thanks, >>> >>> Mike >>> >>> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > Charles C. Berry (858) 534-2098 > Dept of Family/Preventive > Medicine > E mailto:cbe...@tajo.ucsd.edu UC San Diego > http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.