Re: [R] Slow reading multiple tick data files into list of dataframes

2010-10-11 Thread Gabor Grothendieck
On Mon, Oct 11, 2010 at 5:39 PM, rivercode aqua...@gmail.com wrote:

 Hi,

 I am trying to find the best way to read 85 tick data files of format:

 head(nbbo)
 1 bid  CON  09:30:00.722    09:30:00.722  32.71   98
 2 ask  CON  09:30:00.782    09:30:00.810  33.14  300
 3 ask  CON  09:30:00.809    09:30:00.810  33.14  414
 4 bid  CON  09:30:00.783    09:30:00.810  33.06  200

 Each file has between 100,000 to 300,300 rows.

 Currently doing   nbbo.list- lapply(filePath, read.csv)    to create list
 with 85 data.frame objects...but it is taking minutes to read in the data
 and afterwards I get the following message on the console when taking
 further actions (though it does then stop):

    The R Engine is busy. Please wait, and try your command again later.

 filePath in the above example is a vector of filenames:
 head(filePath)
 [1] C:/work/A/A_2010-10-07_nbbo.csv
 [2] C:/work/AAPL/AAPL_2010-10-07_nbbo.csv
 [3] C:/work/ADBE/ADBE_2010-10-07_nbbo.csv
 [4] C:/work/ADI/ADI_2010-10-07_nbbo.csv

 Is there a better/quicker or more R way of doing this ?


You could try (possibly with suitable additonal arguments):

library(sqldf)
lapply(filePath, read.csv.sql)

-- 
Statistics  Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Slow reading multiple tick data files into list of dataframes

2010-10-11 Thread Mike Marchywka






 Date: Mon, 11 Oct 2010 14:39:54 -0700
 From: aqua...@gmail.com
 To: r-help@r-project.org
 Subject: [R] Slow reading multiple tick data files into list of dataframes
[...]
 Is there a better/quicker or more R way of doing this ?

While there may be an obvious R-related answer, usually it helps if you 
can determine where the bottleneck is in terms of 
resources on your platform- often on older machines you
run out of real memory and then all the time is spent reading
the file onto VM back on disk. Can you tell if you are CPU or
memory limited by using task manager? 

It could in fact be that the best solution involves not trying
to hold your entire data set in memory at once, hard to know without
knowing your platform etc. In the
past, I've found that actually sorting data, a slow process
itself, can speed things up a lot due to less thrashing
of memory hierarchy during the later analysis. I doubt 
if that helps your immediate problem but it does point
to one possible non-obvious optimization depending
on what is slowing you down.



 Thanks,
 Chris

 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Slow-reading-multiple-tick-data-files-into-list-of-dataframes-tp2990723p2990723.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Slow reading multiple tick data files into list of dataframes

2010-10-11 Thread jim holtman
For 100,000 rows, it took about 2 seconds to read it in on my system:

 system.time(x - read.table('/recv/test.txt', as.is=TRUE))
   user  system elapsed
   1.920.082.08
 str(x)
'data.frame':   196588 obs. of  7 variables:
 $ V1: int  1 2 3 4 1 2 3 1 2 3 ...
 $ V2: chr  bid ask ask bid ...
 $ V3: chr  CON CON CON CON ...
 $ V4: chr  09:30:00.722 09:30:00.782 09:30:00.809 09:30:00.783 ...
 $ V5: chr  09:30:00.722 09:30:00.810 09:30:00.810 09:30:00.810 ...
 $ V6: num  32.7 33.1 33.1 33.1 32.7 ...
 $ V7: int  98 300 414 200 98 300 414 98 300 414 ...
 object.size(x)
6291928 bytes


Given that you have about 85 files, I would guess that you would need
about 800MB if all were 300K lines longs.  You might be getting memory
fragmentation.  You might try using gc() every so often in the loop.
What are you going to do with the data?  Are you going to make one big
file?  In this case you might want a 64 bit version since you will
have a single instance of 800K and will probably need 2-3X that much
memory if copies are being made during processing.  Object might be
larger in 64-bit.

Maybe you need to follow Gabor's advice and read it into a database
and then process it from there.

On Mon, Oct 11, 2010 at 5:48 PM, Gabor Grothendieck
ggrothendi...@gmail.com wrote:
 On Mon, Oct 11, 2010 at 5:39 PM, rivercode aqua...@gmail.com wrote:

 Hi,

 I am trying to find the best way to read 85 tick data files of format:

 head(nbbo)
 1 bid  CON  09:30:00.722    09:30:00.722  32.71   98
 2 ask  CON  09:30:00.782    09:30:00.810  33.14  300
 3 ask  CON  09:30:00.809    09:30:00.810  33.14  414
 4 bid  CON  09:30:00.783    09:30:00.810  33.06  200

 Each file has between 100,000 to 300,300 rows.

 Currently doing   nbbo.list- lapply(filePath, read.csv)    to create list
 with 85 data.frame objects...but it is taking minutes to read in the data
 and afterwards I get the following message on the console when taking
 further actions (though it does then stop):

    The R Engine is busy. Please wait, and try your command again later.

 filePath in the above example is a vector of filenames:
 head(filePath)
 [1] C:/work/A/A_2010-10-07_nbbo.csv
 [2] C:/work/AAPL/AAPL_2010-10-07_nbbo.csv
 [3] C:/work/ADBE/ADBE_2010-10-07_nbbo.csv
 [4] C:/work/ADI/ADI_2010-10-07_nbbo.csv

 Is there a better/quicker or more R way of doing this ?


 You could try (possibly with suitable additonal arguments):

 library(sqldf)
 lapply(filePath, read.csv.sql)

 --
 Statistics  Software Consulting
 GKX Group, GKX Associates Inc.
 tel: 1-877-GKX-GROUP
 email: ggrothendieck at gmail.com

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.