[Rd] loading multiple CSV files into a single data frame

2012-05-03 Thread victor jimenez
Sometimes I have hundreds of CSV files scattered in a directory tree,
resulting from experiments' executions. For instance, giving an example
from my field, I may want to collect the performance of a processor for
several design parameters such as cache size (possible values: 2, 4, 8
and 16) and cache associativity (possible values: direct-mapped, 4-way,
fully-associative). The results of all these experiments will be stored in
a directory tree like:

results
  |-- direct-mapped
  |   |-- 2 -- data.csv
  |   |-- 4 -- data.csv
  |   |-- 8 -- data.csv
  |   |-- 16 -- data.csv
  |-- 4-way
  |   |-- 2 -- data.csv
  |   |-- 4 -- data.csv
...
  |-- fully-associative
  |   |-- 2 -- data.csv
  |   |-- 4 -- data.csv
...

I am developing a package that would allow me to gather all those CSV into
a single data frame. Currently, I just need to execute the following
statement:

dframe - gather(results/@ASSOC@/@SIZE@/data.csv)

and this command returns a data frame containing the columns ASSOC, SIZE
and all the remaining columns inside the CSV files (in my case the
processor performance), effectively loading all the CSV files into a single
data frame. So, I would get something like:

ASSOC,  SIZE, PERF
direct-mapped,   2, 1.4
direct-mapped,   4, 1.6
direct-mapped,   8, 1.7
direct-mapped, 16, 1.7
4-way,   2, 1.4
4-way,   4, 1.5
...

I would like to ask whether there is any similar functionality already
implemented in R. If so, there is no need to reinvent the wheel :)
If it is not implemented and the R community believes that this feature
would be useful, I would be glad to contribute my code.

Thank you,
Victor

P.S: I was not sure whether to submit this question to R-devel or R-help,
but since it may lead to some programming discussion I decided to post it
to R-devel. Please, let me know if it is better to move it to the other
list.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] loading multiple CSV files into a single data frame

2012-05-03 Thread Gabor Grothendieck
On Thu, May 3, 2012 at 2:07 PM, victor jimenez betaband...@gmail.com wrote:
 Sometimes I have hundreds of CSV files scattered in a directory tree,
 resulting from experiments' executions. For instance, giving an example
 from my field, I may want to collect the performance of a processor for
 several design parameters such as cache size (possible values: 2, 4, 8
 and 16) and cache associativity (possible values: direct-mapped, 4-way,
 fully-associative). The results of all these experiments will be stored in
 a directory tree like:

 results
  |-- direct-mapped
  |       |-- 2 -- data.csv
  |       |-- 4 -- data.csv
  |       |-- 8 -- data.csv
  |       |-- 16 -- data.csv
  |-- 4-way
  |       |-- 2 -- data.csv
  |       |-- 4 -- data.csv
 ...
  |-- fully-associative
  |       |-- 2 -- data.csv
  |       |-- 4 -- data.csv
 ...

 I am developing a package that would allow me to gather all those CSV into
 a single data frame. Currently, I just need to execute the following
 statement:

 dframe - gather(results/@ASSOC@/@SIZE@/data.csv)

 and this command returns a data frame containing the columns ASSOC, SIZE
 and all the remaining columns inside the CSV files (in my case the
 processor performance), effectively loading all the CSV files into a single
 data frame. So, I would get something like:

 ASSOC,          SIZE, PERF
 direct-mapped,       2,     1.4
 direct-mapped,       4,     1.6
 direct-mapped,       8,     1.7
 direct-mapped,     16,     1.7
 4-way,                   2,     1.4
 4-way,                   4,     1.5
 ...

 I would like to ask whether there is any similar functionality already
 implemented in R. If so, there is no need to reinvent the wheel :)
 If it is not implemented and the R community believes that this feature
 would be useful, I would be glad to contribute my code.


If your csv files all have the same columns and represent time series
then read.zoo in the zoo package can read multiple csv files in at
once using a single read.zoo command producing a single zoo object.

library(zoo)
?read.zoo
vignette(zoo-read)

Also see the other zoo vignettes and help files.

-- 
Statistics  Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] loading multiple CSV files into a single data frame

2012-05-03 Thread victor jimenez
First of all, thank you for the answers. I did not know about zoo. However,
it seems that none approach can do what I exactly want (please, correct me
if I am wrong).

Probably, it was not clear in my original question. The CSV files only
contain the performance values. The other two columns (ASSOC and SIZE) are
obtained from the existing values in the directory tree. So, in my opinion,
none of the proposed solutions would work, unless every single data.csv
file contained all the three columns (ASSOC, SIZE and PERF).

In my case, my experimentation framework basically outputs a CSV with some
values read from the processor's performance counters (PMCs). For each
cache size and associativity I conduct an experiment, creating a CSV file,
and placing that file into its own directory. I could modify the
experimentation framework, so that it also outputs the cache size and
associativity, but that may not be ideal in some circumstances and I also
have a significant amount of old results and I want keep using them without
manually fixing the CSV files.

Has anyone else faced such a situation? Any good solutions?

Thank you,
Victor

On Thu, May 3, 2012 at 8:54 PM, Gabor Grothendieck
ggrothendi...@gmail.comwrote:

 On Thu, May 3, 2012 at 2:07 PM, victor jimenez betaband...@gmail.com
 wrote:
  Sometimes I have hundreds of CSV files scattered in a directory tree,
  resulting from experiments' executions. For instance, giving an example
  from my field, I may want to collect the performance of a processor for
  several design parameters such as cache size (possible values: 2, 4, 8
  and 16) and cache associativity (possible values: direct-mapped, 4-way,
  fully-associative). The results of all these experiments will be stored
 in
  a directory tree like:
 
  results
   |-- direct-mapped
   |   |-- 2 -- data.csv
   |   |-- 4 -- data.csv
   |   |-- 8 -- data.csv
   |   |-- 16 -- data.csv
   |-- 4-way
   |   |-- 2 -- data.csv
   |   |-- 4 -- data.csv
  ...
   |-- fully-associative
   |   |-- 2 -- data.csv
   |   |-- 4 -- data.csv
  ...
 
  I am developing a package that would allow me to gather all those CSV
 into
  a single data frame. Currently, I just need to execute the following
  statement:
 
  dframe - gather(results/@ASSOC@/@SIZE@/data.csv)
 
  and this command returns a data frame containing the columns ASSOC, SIZE
  and all the remaining columns inside the CSV files (in my case the
  processor performance), effectively loading all the CSV files into a
 single
  data frame. So, I would get something like:
 
  ASSOC,  SIZE, PERF
  direct-mapped,   2, 1.4
  direct-mapped,   4, 1.6
  direct-mapped,   8, 1.7
  direct-mapped, 16, 1.7
  4-way,   2, 1.4
  4-way,   4, 1.5
  ...
 
  I would like to ask whether there is any similar functionality already
  implemented in R. If so, there is no need to reinvent the wheel :)
  If it is not implemented and the R community believes that this feature
  would be useful, I would be glad to contribute my code.
 

 If your csv files all have the same columns and represent time series
 then read.zoo in the zoo package can read multiple csv files in at
 once using a single read.zoo command producing a single zoo object.

 library(zoo)
 ?read.zoo
 vignette(zoo-read)

 Also see the other zoo vignettes and help files.

 --
 Statistics  Software Consulting
 GKX Group, GKX Associates Inc.
 tel: 1-877-GKX-GROUP
 email: ggrothendieck at gmail.com


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] loading multiple CSV files into a single data frame

2012-05-03 Thread Cook, Malcolm
Victor,

I understand you as follows

The first two columns of the desired combined dataframe are the last two
levels of the pathname to the csv file.

The columns in all the data.csv files are the same, namely, there is 
only
one column, and it is named PERF.

If so, the following should work (on unix)

do.call(rbind,lapply(Sys.glob('results/*/*/data.csv'),function(path)
{within(read.csv(path),{ SIZE-basename(dirname(path));
ASSOC-basename(dirname(dirname(path)))})}))


On 5/3/12 4:40 PM, victor jimenez betaband...@gmail.com wrote:

First of all, thank you for the answers. I did not know about zoo.
However,
it seems that none approach can do what I exactly want (please, correct me
if I am wrong).

Probably, it was not clear in my original question. The CSV files only
contain the performance values. The other two columns (ASSOC and SIZE) are
obtained from the existing values in the directory tree. So, in my
opinion,
none of the proposed solutions would work, unless every single data.csv
file contained all the three columns (ASSOC, SIZE and PERF).

In my case, my experimentation framework basically outputs a CSV with some
values read from the processor's performance counters (PMCs). For each
cache size and associativity I conduct an experiment, creating a CSV file,
and placing that file into its own directory. I could modify the
experimentation framework, so that it also outputs the cache size and
associativity, but that may not be ideal in some circumstances and I also
have a significant amount of old results and I want keep using them
without
manually fixing the CSV files.

Has anyone else faced such a situation? Any good solutions?

Thank you,
Victor

On Thu, May 3, 2012 at 8:54 PM, Gabor Grothendieck
ggrothendi...@gmail.comwrote:

 On Thu, May 3, 2012 at 2:07 PM, victor jimenez betaband...@gmail.com
 wrote:
  Sometimes I have hundreds of CSV files scattered in a directory tree,
  resulting from experiments' executions. For instance, giving an
example
  from my field, I may want to collect the performance of a processor
for
  several design parameters such as cache size (possible values: 2,
4, 8
  and 16) and cache associativity (possible values: direct-mapped,
4-way,
  fully-associative). The results of all these experiments will be
stored
 in
  a directory tree like:
 
  results
   |-- direct-mapped
   |   |-- 2 -- data.csv
   |   |-- 4 -- data.csv
   |   |-- 8 -- data.csv
   |   |-- 16 -- data.csv
   |-- 4-way
   |   |-- 2 -- data.csv
   |   |-- 4 -- data.csv
  ...
   |-- fully-associative
   |   |-- 2 -- data.csv
   |   |-- 4 -- data.csv
  ...
 
  I am developing a package that would allow me to gather all those CSV
 into
  a single data frame. Currently, I just need to execute the following
  statement:
 
  dframe - gather(results/@ASSOC@/@SIZE@/data.csv)
 
  and this command returns a data frame containing the columns ASSOC,
SIZE
  and all the remaining columns inside the CSV files (in my case the
  processor performance), effectively loading all the CSV files into a
 single
  data frame. So, I would get something like:
 
  ASSOC,  SIZE, PERF
  direct-mapped,   2, 1.4
  direct-mapped,   4, 1.6
  direct-mapped,   8, 1.7
  direct-mapped, 16, 1.7
  4-way,   2, 1.4
  4-way,   4, 1.5
  ...
 
  I would like to ask whether there is any similar functionality already
  implemented in R. If so, there is no need to reinvent the wheel :)
  If it is not implemented and the R community believes that this
feature
  would be useful, I would be glad to contribute my code.
 

 If your csv files all have the same columns and represent time series
 then read.zoo in the zoo package can read multiple csv files in at
 once using a single read.zoo command producing a single zoo object.

 library(zoo)
 ?read.zoo
 vignette(zoo-read)

 Also see the other zoo vignettes and help files.

 --
 Statistics  Software Consulting
 GKX Group, GKX Associates Inc.
 tel: 1-877-GKX-GROUP
 email: ggrothendieck at gmail.com


   [[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] loading multiple CSV files into a single data frame

2012-05-03 Thread Simon Urbanek

On May 3, 2012, at 5:40 PM, victor jimenez wrote:

 First of all, thank you for the answers. I did not know about zoo. However,
 it seems that none approach can do what I exactly want (please, correct me
 if I am wrong).
 
 Probably, it was not clear in my original question. The CSV files only
 contain the performance values. The other two columns (ASSOC and SIZE) are
 obtained from the existing values in the directory tree. So, in my opinion,
 none of the proposed solutions would work, unless every single data.csv
 file contained all the three columns (ASSOC, SIZE and PERF).
 
 In my case, my experimentation framework basically outputs a CSV with some
 values read from the processor's performance counters (PMCs). For each
 cache size and associativity I conduct an experiment, creating a CSV file,
 and placing that file into its own directory. I could modify the
 experimentation framework, so that it also outputs the cache size and
 associativity, but that may not be ideal in some circumstances and I also
 have a significant amount of old results and I want keep using them without
 manually fixing the CSV files.
 

You don't need to touch the CSV files, simply add values at load time - this is 
all easily doable in one line ;)

 do.call(rbind,lapply(Sys.glob(*/*/data.csv),function(d) 
 cbind(read.csv(d),as.data.frame(t(strsplit(d,/)[[1]])
  A B V1 V2   V3
1 1 2  1  a data.csv
2 3 4  1  a data.csv
3 1 2  1  b data.csv
4 3 4  1  b data.csv
5 1 2  2  a data.csv
6 3 4  2  a data.csv


 Has anyone else faced such a situation? Any good solutions?
 
 Thank you,
 Victor
 
 On Thu, May 3, 2012 at 8:54 PM, Gabor Grothendieck
 ggrothendi...@gmail.comwrote:
 
 On Thu, May 3, 2012 at 2:07 PM, victor jimenez betaband...@gmail.com
 wrote:
 Sometimes I have hundreds of CSV files scattered in a directory tree,
 resulting from experiments' executions. For instance, giving an example
 from my field, I may want to collect the performance of a processor for
 several design parameters such as cache size (possible values: 2, 4, 8
 and 16) and cache associativity (possible values: direct-mapped, 4-way,
 fully-associative). The results of all these experiments will be stored
 in
 a directory tree like:
 
 results
 |-- direct-mapped
 |   |-- 2 -- data.csv
 |   |-- 4 -- data.csv
 |   |-- 8 -- data.csv
 |   |-- 16 -- data.csv
 |-- 4-way
 |   |-- 2 -- data.csv
 |   |-- 4 -- data.csv
 ...
 |-- fully-associative
 |   |-- 2 -- data.csv
 |   |-- 4 -- data.csv
 ...
 
 I am developing a package that would allow me to gather all those CSV
 into
 a single data frame. Currently, I just need to execute the following
 statement:
 
 dframe - gather(results/@ASSOC@/@SIZE@/data.csv)
 
 and this command returns a data frame containing the columns ASSOC, SIZE
 and all the remaining columns inside the CSV files (in my case the
 processor performance), effectively loading all the CSV files into a
 single
 data frame. So, I would get something like:
 
 ASSOC,  SIZE, PERF
 direct-mapped,   2, 1.4
 direct-mapped,   4, 1.6
 direct-mapped,   8, 1.7
 direct-mapped, 16, 1.7
 4-way,   2, 1.4
 4-way,   4, 1.5
 ...
 
 I would like to ask whether there is any similar functionality already
 implemented in R. If so, there is no need to reinvent the wheel :)
 If it is not implemented and the R community believes that this feature
 would be useful, I would be glad to contribute my code.
 
 
 If your csv files all have the same columns and represent time series
 then read.zoo in the zoo package can read multiple csv files in at
 once using a single read.zoo command producing a single zoo object.
 
 library(zoo)
 ?read.zoo
 vignette(zoo-read)
 
 Also see the other zoo vignettes and help files.
 
 --
 Statistics  Software Consulting
 GKX Group, GKX Associates Inc.
 tel: 1-877-GKX-GROUP
 email: ggrothendieck at gmail.com
 
 
   [[alternative HTML version deleted]]
 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel
 
 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel