or that! :-D thanks jim.
b

On Sep 25, 2009, at 3:57 PM, jim holtman wrote:

Here is how much time it took to read a file with 10 lines and 700,000
columns per line separated with comma:

system.time(input <- scan("/tempxx.txt", what=0, sep=','))
Read 7000000 items
  user  system elapsed
 15.62    0.22   15.84
object.size(input)
56000024 bytes


'scan' should be sufficient and it will not take another 10 minutes in awk.

On Fri, Sep 25, 2009 at 1:17 PM, Charles C. Berry <cbe...@tajo.ucsd.edu > wrote:
On Fri, 25 Sep 2009, Ping-Hsun Hsieh wrote:

Thanks, Ben.

The matrix is a pure numeric matrix (6x700000, 31mb).
I tried the colClasses='numeric' as well as nrows=7(one of these is header
line) on the matrix.
Also I tested it with not setting the two options in read.delim()


A couple of things come to mind.

First, I have not read the internals of scan, but suspect that parsing a
really long line may be slowing things down.

Since you are attempting to read in a numeric matrix, you can simply do a global replacement of your delimiter with a newline and use scan on the
result. On unix-like systems, something like

      tmp <- scan( pipe( 'tr "\t" "\n"  < test_data.txt' ) )

ought to help.

Second, the memory occupied by each line - once it has been processed - is spread over the full 32MB (or 3.2 GB for the 600 by 700000 version) region of memory. I am guessing that this is causing your cache to work hard to put
it in place.

If you really want the result to be a 600 by 700000 matrix, you might try to read it in smaller blocks using scan( pipe( "cut ... " ) ) to feed selected
blocks of columns of your text file to R.

HTH,

Chuck



Here is the time spent on reading the matrix for each test.

system.time( tmp <- read.delim("test_data.txt"))

  user    system   elapsed
50985.421    27.665 51013.384

system.time(tmp <-
read .delim ("test_data.txt",colClasses="numeric",nrows=7,comment.char=""))

  user    system   elapsed
51301.563    60.491 51362.208

It seems setting the options does not speed up the reading at all.
Is it because of the header line? I will test it.
Did I misunderstand something?

One additional and interesting observation:
The one with the options does save memory a lot. It took ~150mb, while the
other took ~4GB for reading the matrix.

I will try the scan() and see if it helps.

Thanks!
Mike


-----Original Message-----
From: Benilton Carvalho [mailto:bcarv...@jhsph.edu]
Sent: Wednesday, September 23, 2009 4:56 PM
To: Ping-Hsun Hsieh
Cc: r-help@r-project.org
Subject: Re: [R] read.delim very slow in reading files with lots of
columns

use the 'colClasses' argument and you can also set 'nrows'.

b

On Sep 23, 2009, at 8:24 PM, Ping-Hsun Hsieh wrote:

Hi,



I am trying to read a tab-delimited file into R (Ver. 2.8). The
machine I am using is 64bit Linux with 16 GB.

The file is basically a matrix(~600x700000) and as large as 3GB.



The read.delim() ran extremely slow (hours) even with a subset of
the file (31 MB with 6x700000)

I monitored the memory usage, and found it constantly only took less
than 1% of 16GB memory.

Does read.delim() have difficulty to read files with lots of columns?

Any suggestions?



Thanks,

Mike




     [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Charles C. Berry                            (858) 534-2098
                                          Dept of Family/Preventive
Medicine
E mailto:cbe...@tajo.ucsd.edu               UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to