[Rd] loading multiple CSV files into a single data frame
Sometimes I have hundreds of CSV files scattered in a directory tree, resulting from experiments' executions. For instance, giving an example from my field, I may want to collect the performance of a processor for several design parameters such as cache size (possible values: 2, 4, 8 and 16) and cache associativity (possible values: direct-mapped, 4-way, fully-associative). The results of all these experiments will be stored in a directory tree like: results |-- direct-mapped | |-- 2 -- data.csv | |-- 4 -- data.csv | |-- 8 -- data.csv | |-- 16 -- data.csv |-- 4-way | |-- 2 -- data.csv | |-- 4 -- data.csv ... |-- fully-associative | |-- 2 -- data.csv | |-- 4 -- data.csv ... I am developing a package that would allow me to gather all those CSV into a single data frame. Currently, I just need to execute the following statement: dframe - gather(results/@ASSOC@/@SIZE@/data.csv) and this command returns a data frame containing the columns ASSOC, SIZE and all the remaining columns inside the CSV files (in my case the processor performance), effectively loading all the CSV files into a single data frame. So, I would get something like: ASSOC, SIZE, PERF direct-mapped, 2, 1.4 direct-mapped, 4, 1.6 direct-mapped, 8, 1.7 direct-mapped, 16, 1.7 4-way, 2, 1.4 4-way, 4, 1.5 ... I would like to ask whether there is any similar functionality already implemented in R. If so, there is no need to reinvent the wheel :) If it is not implemented and the R community believes that this feature would be useful, I would be glad to contribute my code. Thank you, Victor P.S: I was not sure whether to submit this question to R-devel or R-help, but since it may lead to some programming discussion I decided to post it to R-devel. Please, let me know if it is better to move it to the other list. [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] loading multiple CSV files into a single data frame
On Thu, May 3, 2012 at 2:07 PM, victor jimenez betaband...@gmail.com wrote: Sometimes I have hundreds of CSV files scattered in a directory tree, resulting from experiments' executions. For instance, giving an example from my field, I may want to collect the performance of a processor for several design parameters such as cache size (possible values: 2, 4, 8 and 16) and cache associativity (possible values: direct-mapped, 4-way, fully-associative). The results of all these experiments will be stored in a directory tree like: results |-- direct-mapped | |-- 2 -- data.csv | |-- 4 -- data.csv | |-- 8 -- data.csv | |-- 16 -- data.csv |-- 4-way | |-- 2 -- data.csv | |-- 4 -- data.csv ... |-- fully-associative | |-- 2 -- data.csv | |-- 4 -- data.csv ... I am developing a package that would allow me to gather all those CSV into a single data frame. Currently, I just need to execute the following statement: dframe - gather(results/@ASSOC@/@SIZE@/data.csv) and this command returns a data frame containing the columns ASSOC, SIZE and all the remaining columns inside the CSV files (in my case the processor performance), effectively loading all the CSV files into a single data frame. So, I would get something like: ASSOC, SIZE, PERF direct-mapped, 2, 1.4 direct-mapped, 4, 1.6 direct-mapped, 8, 1.7 direct-mapped, 16, 1.7 4-way, 2, 1.4 4-way, 4, 1.5 ... I would like to ask whether there is any similar functionality already implemented in R. If so, there is no need to reinvent the wheel :) If it is not implemented and the R community believes that this feature would be useful, I would be glad to contribute my code. If your csv files all have the same columns and represent time series then read.zoo in the zoo package can read multiple csv files in at once using a single read.zoo command producing a single zoo object. library(zoo) ?read.zoo vignette(zoo-read) Also see the other zoo vignettes and help files. -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] loading multiple CSV files into a single data frame
First of all, thank you for the answers. I did not know about zoo. However, it seems that none approach can do what I exactly want (please, correct me if I am wrong). Probably, it was not clear in my original question. The CSV files only contain the performance values. The other two columns (ASSOC and SIZE) are obtained from the existing values in the directory tree. So, in my opinion, none of the proposed solutions would work, unless every single data.csv file contained all the three columns (ASSOC, SIZE and PERF). In my case, my experimentation framework basically outputs a CSV with some values read from the processor's performance counters (PMCs). For each cache size and associativity I conduct an experiment, creating a CSV file, and placing that file into its own directory. I could modify the experimentation framework, so that it also outputs the cache size and associativity, but that may not be ideal in some circumstances and I also have a significant amount of old results and I want keep using them without manually fixing the CSV files. Has anyone else faced such a situation? Any good solutions? Thank you, Victor On Thu, May 3, 2012 at 8:54 PM, Gabor Grothendieck ggrothendi...@gmail.comwrote: On Thu, May 3, 2012 at 2:07 PM, victor jimenez betaband...@gmail.com wrote: Sometimes I have hundreds of CSV files scattered in a directory tree, resulting from experiments' executions. For instance, giving an example from my field, I may want to collect the performance of a processor for several design parameters such as cache size (possible values: 2, 4, 8 and 16) and cache associativity (possible values: direct-mapped, 4-way, fully-associative). The results of all these experiments will be stored in a directory tree like: results |-- direct-mapped | |-- 2 -- data.csv | |-- 4 -- data.csv | |-- 8 -- data.csv | |-- 16 -- data.csv |-- 4-way | |-- 2 -- data.csv | |-- 4 -- data.csv ... |-- fully-associative | |-- 2 -- data.csv | |-- 4 -- data.csv ... I am developing a package that would allow me to gather all those CSV into a single data frame. Currently, I just need to execute the following statement: dframe - gather(results/@ASSOC@/@SIZE@/data.csv) and this command returns a data frame containing the columns ASSOC, SIZE and all the remaining columns inside the CSV files (in my case the processor performance), effectively loading all the CSV files into a single data frame. So, I would get something like: ASSOC, SIZE, PERF direct-mapped, 2, 1.4 direct-mapped, 4, 1.6 direct-mapped, 8, 1.7 direct-mapped, 16, 1.7 4-way, 2, 1.4 4-way, 4, 1.5 ... I would like to ask whether there is any similar functionality already implemented in R. If so, there is no need to reinvent the wheel :) If it is not implemented and the R community believes that this feature would be useful, I would be glad to contribute my code. If your csv files all have the same columns and represent time series then read.zoo in the zoo package can read multiple csv files in at once using a single read.zoo command producing a single zoo object. library(zoo) ?read.zoo vignette(zoo-read) Also see the other zoo vignettes and help files. -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] loading multiple CSV files into a single data frame
Victor, I understand you as follows The first two columns of the desired combined dataframe are the last two levels of the pathname to the csv file. The columns in all the data.csv files are the same, namely, there is only one column, and it is named PERF. If so, the following should work (on unix) do.call(rbind,lapply(Sys.glob('results/*/*/data.csv'),function(path) {within(read.csv(path),{ SIZE-basename(dirname(path)); ASSOC-basename(dirname(dirname(path)))})})) On 5/3/12 4:40 PM, victor jimenez betaband...@gmail.com wrote: First of all, thank you for the answers. I did not know about zoo. However, it seems that none approach can do what I exactly want (please, correct me if I am wrong). Probably, it was not clear in my original question. The CSV files only contain the performance values. The other two columns (ASSOC and SIZE) are obtained from the existing values in the directory tree. So, in my opinion, none of the proposed solutions would work, unless every single data.csv file contained all the three columns (ASSOC, SIZE and PERF). In my case, my experimentation framework basically outputs a CSV with some values read from the processor's performance counters (PMCs). For each cache size and associativity I conduct an experiment, creating a CSV file, and placing that file into its own directory. I could modify the experimentation framework, so that it also outputs the cache size and associativity, but that may not be ideal in some circumstances and I also have a significant amount of old results and I want keep using them without manually fixing the CSV files. Has anyone else faced such a situation? Any good solutions? Thank you, Victor On Thu, May 3, 2012 at 8:54 PM, Gabor Grothendieck ggrothendi...@gmail.comwrote: On Thu, May 3, 2012 at 2:07 PM, victor jimenez betaband...@gmail.com wrote: Sometimes I have hundreds of CSV files scattered in a directory tree, resulting from experiments' executions. For instance, giving an example from my field, I may want to collect the performance of a processor for several design parameters such as cache size (possible values: 2, 4, 8 and 16) and cache associativity (possible values: direct-mapped, 4-way, fully-associative). The results of all these experiments will be stored in a directory tree like: results |-- direct-mapped | |-- 2 -- data.csv | |-- 4 -- data.csv | |-- 8 -- data.csv | |-- 16 -- data.csv |-- 4-way | |-- 2 -- data.csv | |-- 4 -- data.csv ... |-- fully-associative | |-- 2 -- data.csv | |-- 4 -- data.csv ... I am developing a package that would allow me to gather all those CSV into a single data frame. Currently, I just need to execute the following statement: dframe - gather(results/@ASSOC@/@SIZE@/data.csv) and this command returns a data frame containing the columns ASSOC, SIZE and all the remaining columns inside the CSV files (in my case the processor performance), effectively loading all the CSV files into a single data frame. So, I would get something like: ASSOC, SIZE, PERF direct-mapped, 2, 1.4 direct-mapped, 4, 1.6 direct-mapped, 8, 1.7 direct-mapped, 16, 1.7 4-way, 2, 1.4 4-way, 4, 1.5 ... I would like to ask whether there is any similar functionality already implemented in R. If so, there is no need to reinvent the wheel :) If it is not implemented and the R community believes that this feature would be useful, I would be glad to contribute my code. If your csv files all have the same columns and represent time series then read.zoo in the zoo package can read multiple csv files in at once using a single read.zoo command producing a single zoo object. library(zoo) ?read.zoo vignette(zoo-read) Also see the other zoo vignettes and help files. -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] loading multiple CSV files into a single data frame
On May 3, 2012, at 5:40 PM, victor jimenez wrote: First of all, thank you for the answers. I did not know about zoo. However, it seems that none approach can do what I exactly want (please, correct me if I am wrong). Probably, it was not clear in my original question. The CSV files only contain the performance values. The other two columns (ASSOC and SIZE) are obtained from the existing values in the directory tree. So, in my opinion, none of the proposed solutions would work, unless every single data.csv file contained all the three columns (ASSOC, SIZE and PERF). In my case, my experimentation framework basically outputs a CSV with some values read from the processor's performance counters (PMCs). For each cache size and associativity I conduct an experiment, creating a CSV file, and placing that file into its own directory. I could modify the experimentation framework, so that it also outputs the cache size and associativity, but that may not be ideal in some circumstances and I also have a significant amount of old results and I want keep using them without manually fixing the CSV files. You don't need to touch the CSV files, simply add values at load time - this is all easily doable in one line ;) do.call(rbind,lapply(Sys.glob(*/*/data.csv),function(d) cbind(read.csv(d),as.data.frame(t(strsplit(d,/)[[1]]) A B V1 V2 V3 1 1 2 1 a data.csv 2 3 4 1 a data.csv 3 1 2 1 b data.csv 4 3 4 1 b data.csv 5 1 2 2 a data.csv 6 3 4 2 a data.csv Has anyone else faced such a situation? Any good solutions? Thank you, Victor On Thu, May 3, 2012 at 8:54 PM, Gabor Grothendieck ggrothendi...@gmail.comwrote: On Thu, May 3, 2012 at 2:07 PM, victor jimenez betaband...@gmail.com wrote: Sometimes I have hundreds of CSV files scattered in a directory tree, resulting from experiments' executions. For instance, giving an example from my field, I may want to collect the performance of a processor for several design parameters such as cache size (possible values: 2, 4, 8 and 16) and cache associativity (possible values: direct-mapped, 4-way, fully-associative). The results of all these experiments will be stored in a directory tree like: results |-- direct-mapped | |-- 2 -- data.csv | |-- 4 -- data.csv | |-- 8 -- data.csv | |-- 16 -- data.csv |-- 4-way | |-- 2 -- data.csv | |-- 4 -- data.csv ... |-- fully-associative | |-- 2 -- data.csv | |-- 4 -- data.csv ... I am developing a package that would allow me to gather all those CSV into a single data frame. Currently, I just need to execute the following statement: dframe - gather(results/@ASSOC@/@SIZE@/data.csv) and this command returns a data frame containing the columns ASSOC, SIZE and all the remaining columns inside the CSV files (in my case the processor performance), effectively loading all the CSV files into a single data frame. So, I would get something like: ASSOC, SIZE, PERF direct-mapped, 2, 1.4 direct-mapped, 4, 1.6 direct-mapped, 8, 1.7 direct-mapped, 16, 1.7 4-way, 2, 1.4 4-way, 4, 1.5 ... I would like to ask whether there is any similar functionality already implemented in R. If so, there is no need to reinvent the wheel :) If it is not implemented and the R community believes that this feature would be useful, I would be glad to contribute my code. If your csv files all have the same columns and represent time series then read.zoo in the zoo package can read multiple csv files in at once using a single read.zoo command producing a single zoo object. library(zoo) ?read.zoo vignette(zoo-read) Also see the other zoo vignettes and help files. -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel