Re: [R] read.delim very slow in reading files with lots of columns

Charles C. Berry Fri, 25 Sep 2009 10:18:49 -0700

On Fri, 25 Sep 2009, Ping-Hsun Hsieh wrote:

Thanks, Ben.


The matrix is a pure numeric matrix (6x700000, 31mb).
I tried the colClasses='numeric' as well as nrows=7(one of these is header 
line) on the matrix.
Also I tested it with not setting the two options in read.delim()



A couple of things come to mind.

First, I have not read the internals of scan, but suspect that parsing areally long line may be slowing things down.

Since you are attempting to read in a numeric matrix, you can simply do aglobal replacement of your delimiter with a newline and use scan onthe result. On unix-like systems, something like


        tmp <- scan( pipe( 'tr "\t" "\n"  < test_data.txt' ) )

ought to help.

Second, the memory occupied by each line - once it has been processed - isspread over the full 32MB (or 3.2 GB for the 600 by 700000 version) regionof memory. I am guessing that this is causing your cache to work hard toput it in place.

If you really want the result to be a 600 by 700000 matrix, you might tryto read it in smaller blocks using scan( pipe( "cut ... " ) ) to feedselected blocks of columns of your text file to R.


HTH,

Chuck


Here is the time spent on reading the matrix for each test.

system.time( tmp <- read.delim("test_data.txt"))

    user    system   elapsed
50985.421    27.665 51013.384

system.time(tmp <- 
read.delim("test_data.txt",colClasses="numeric",nrows=7,comment.char=""))

    user    system   elapsed
51301.563    60.491 51362.208

It seems setting the options does not speed up the reading at all.
Is it because of the header line? I will test it.
Did I misunderstand something?

One additional and interesting observation:
The one with the options does save memory a lot. It took ~150mb, while the 
other took ~4GB for reading the matrix.

I will try the scan() and see if it helps.

Thanks!
Mike


-----Original Message-----
From: Benilton Carvalho [mailto:bcarv...@jhsph.edu]
Sent: Wednesday, September 23, 2009 4:56 PM
To: Ping-Hsun Hsieh
Cc: r-help@r-project.org
Subject: Re: [R] read.delim very slow in reading files with lots of columns

use the 'colClasses' argument and you can also set 'nrows'.

b

On Sep 23, 2009, at 8:24 PM, Ping-Hsun Hsieh wrote:

Hi,



I am trying to read a tab-delimited file into R (Ver. 2.8). The
machine I am using is 64bit Linux with 16 GB.

The file is basically a matrix(~600x700000) and as large as 3GB.



The read.delim() ran extremely slow (hours) even with a subset of
the file (31 MB with 6x700000)

I monitored the memory usage, and found it constantly only took less
than 1% of 16GB memory.

Does read.delim() have difficulty to read files with lots of columns?

Any suggestions?



Thanks,

Mike




       [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Charles C. Berry                            (858) 534-2098
                                            Dept of Family/Preventive Medicine
E mailto:cbe...@tajo.ucsd.edu               UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] read.delim very slow in reading files with lots of columns

Reply via email to