Re: [R] mergeing a large number of large .csvs

jim holtman Sat, 03 Nov 2012 14:21:30 -0700

It easier than that.  I forgot I can do it entirely within R:

setwd("/temp/csv")
files <- Sys.glob("daily*csv")
output <- file('Rcombined.csv', 'w')
for (i in files){
    cat(i, '\n')  # write out file processing
    input <- readLines(i)
    input <- input[-1L]  # delete header
    writeLines(input, output)
}
close(output)




On Sat, Nov 3, 2012 at 4:56 PM, jim holtman <jholt...@gmail.com> wrote:
> These are not commands, but programs you can use.  Here is a file copy
> program in "perl" (I spelt it wrong in the email);  This will copy all
> the files that have "daily" in their names.  It also skips the first
> line of each file assuming that it is the header.
>
> perl  can be found on most systems.  www.activestate.com  has a
> version that runs under Windows and that is what I am using.
>
>
> chdir "/temp/csv";  # my directory with files
> @files = glob "daily*csv";  # get files to copy (daily......csv)
> open OUTPUT, ">combined.csv"; # output file
> # loop for each file
> foreach $file (@files) {
>     print $file, "\n";  # print file being processed
>     open INPUT, "<" . $file;
>     # assume that the first line is a header, so skip it
>     $header = <INPUT>;
>     @all = <INPUT>;  # read rest of the file
>     close INPUT;
>     print OUTPUT @all;  # append to the output
> }
> close OUTPUT;
>
> Here is what was printed on the console:
>
>
> C:\Users\Owner>perl copyFiles.pl
> daily.BO.csv
> daily.C.csv
> daily.CL.csv
> daily.CT.csv
> daily.GC.csv
> daily.HO.csv
> daily.KC.csv
> daily.LA.csv
> daily.LN.csv
> daily.LP.csv
> daily.LX.csv
> daily.NG.csv
> daily.S.csv
> daily.SB.csv
> daily.SI.csv
> daily.SM.csv
>
> Which was a list of all the files copied.
>
> On Sat, Nov 3, 2012 at 4:08 PM, Benjamin Caldwell
> <btcaldw...@berkeley.edu> wrote:
>> Jim,
>>
>> Where can I find documentation of the commands you mention?
>> Thanks
>>
>>
>>
>>
>>
>> On Sat, Nov 3, 2012 at 12:15 PM, jim holtman <jholt...@gmail.com> wrote:
>>>
>>> A faster way would be to use something like 'per', 'awk' or 'sed'.
>>> You can strip off the header line of each CSV (if it has one) and then
>>> concatenate the files together.  This is very efficient use of memory
>>> since you are just reading one file at a time and then writing it out.
>>>  Will probably be a lot faster since no conversions have to be done.
>>> Once you have the one large file, then you can play with it (load it
>>> if you have enough memory, or load it into a database).
>>>
>>> On Sat, Nov 3, 2012 at 11:37 AM, Jeff Newmiller
>>> <jdnew...@dcn.davis.ca.us> wrote:
>>> > On the absence of any data examples from you per the posting guidelines,
>>> > I will refer you to the help files for the melt function in the reshape2
>>> > package.  Note that there can be various mixtures of wide versus long...
>>> > such as a wide file with one date column and columns representing all 
>>> > stock
>>> > prices and all trade volumes. The longest format would be what melt gives
>>> > (date, column name, and value) but an in-between format would have one
>>> > distinct column each for dollar values and volume values with a column
>>> > indicating ticker label and of course another for date.
>>> >
>>> > If your csv files can be grouped according to those with similar column
>>> > "types", then as you read them in you can use cbind( csvlabel="somelabel",
>>> > csvdf) to distinguish it and then rbind those data frames together to 
>>> > create
>>> > an intermediate-width data frame. When dealing with large amounts of data
>>> > you will want to minimize the amount of reshaping you do, but it would
>>> > require knowledge of your data and algorithms to say any more.
>>> >
>>> > ---------------------------------------------------------------------------
>>> > Jeff Newmiller                        The     .....       .....  Go
>>> > Live...
>>> > DCN:<jdnew...@dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
>>> > Go...
>>> >                                       Live:   OO#.. Dead: OO#..  Playing
>>> > Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>>> > /Software/Embedded Controllers)               .OO#.       .OO#.
>>> > rocks...1k
>>> >
>>> > ---------------------------------------------------------------------------
>>> > Sent from my phone. Please excuse my brevity.
>>> >
>>> > Benjamin Caldwell <btcaldw...@berkeley.edu> wrote:
>>> >
>>> >>Jeff,
>>> >>If you're willing to educate, I'd be happy to learn what wide vs long
>>> >>format means. I'll give rbind a shot in the meantime.
>>> >>Ben
>>> >>On Nov 2, 2012 4:31 PM, "Jeff Newmiller" <jdnew...@dcn.davis.ca.us>
>>> >>wrote:
>>> >>
>>> >>> I would first confirm that you need the data in wide format... many
>>> >>> algorithms are more efficient in long format anyway, and rbind is way
>>> >>more
>>> >>> efficient than merge.
>>> >>>
>>> >>> If you feel this is not negotiable, you may want to consider sqldf.
>>> >>Yes,
>>> >>> you need to learn a bit of SQL, but it is very well integrated into
>>> >>R.
>>> >>>
>>>
>>> >> >>---------------------------------------------------------------------------
>>> >>> Jeff Newmiller                        The     .....       .....  Go
>>> >>Live...
>>> >>> DCN:<jdnew...@dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
>>> >>> Go...
>>> >>>                                       Live:   OO#.. Dead: OO#..
>>> >>Playing
>>> >>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>>> >>> /Software/Embedded Controllers)               .OO#.       .OO#.
>>> >>rocks...1k
>>> >>>
>>>
>>> >> >>---------------------------------------------------------------------------
>>> >>> Sent from my phone. Please excuse my brevity.
>>> >>>
>>> >>> Benjamin Caldwell <btcaldw...@berkeley.edu> wrote:
>>> >>>
>>> >>> >Dear R help;
>>> >>> >I'm currently trying to combine a large number (about 30 x 30) of
>>> >>large
>>> >>> >.csvs together (each at least 10000 records). They are organized by
>>> >>> >plots,
>>> >>> >hence 30 X 30, with each group of csvs in a folder which corresponds
>>> >>to
>>> >>> >the
>>> >>> >plot. The unmerged csvs all have the same number of columns (5). The
>>> >>> >fifth
>>> >>> >column has a different name for each csv. The number of rows is
>>> >>> >different.
>>> >>> >
>>> >>> >The combined csvs are of course quite large, and the code I'm
>>> >>running
>>> >>> >is
>>> >>> >quite slow - I'm currently running it on a computer with 10 GB ram,
>>> >>> >ssd,
>>> >>> >and quad core 2.3 ghz processor; it's taken 8 hours and it's only
>>> >>75%
>>> >>> >of
>>> >>> >the way through (it's hung up on one of the largest data groupings
>>> >>now
>>> >>> >for
>>> >>> >an hour, and using 3.5 gigs of RAM.
>>> >>> >
>>> >>> >I know that R isn't the most efficient way of doing this, but I'm
>>> >>not
>>> >>> >familiar with sql or C. I wonder if anyone has suggestions for a
>>> >>> >different
>>> >>> >way to do this in the R environment. For instance, the key function
>>> >>now
>>> >>> >is
>>> >>> >merge, but I haven't tried join from the plyr package or rbind from
>>> >>> >base.
>>> >>> >I'm willing to provide a dropbox link to a couple of these files if
>>> >>> >you'd
>>> >>> >like to see the data. My code is as follows:
>>> >>> >
>>> >>> >
>>> >>> >#multmerge is based on code by Tony cookson,
>>> >>> >
>>> >>>
>>>
>>> >> >>http://www.r-bloggers.com/merging-multiple-data-files-into-one-data-frame/
>>> >>> ;
>>> >>> >The function takes a path. This path should be the name of a folder
>>> >>> >that
>>> >>> >contains all of the files you would like to read and merge together
>>> >>and
>>> >>> >only those files you would like to merge.
>>> >>> >
>>> >>> >multmerge = function(mypath){
>>> >>> >filenames=list.files(path=mypath, full.names=TRUE)
>>> >>> >datalist = try(lapply(filenames,
>>> >>> >function(x){read.csv(file=x,header=T)}))
>>> >>> >try(Reduce(function(x,y) {merge(x, y, all=TRUE)}, datalist))
>>> >>> >}
>>> >>> >
>>> >>> >#this function renames files using a fixed list and outputs a .csv
>>> >>> >
>>> >>> >merepk <- function (path, nf.name) {
>>> >>> >
>>> >>> >output<-multmerge(mypath=path)
>>> >>> >name <- list("x", "y", "z", "depth", "amplitude")
>>> >>> >try(names(output) <- name)
>>> >>> >
>>> >>> >write.csv(output, nf.name)
>>> >>> >}
>>> >>> >
>>> >>> >#assumes all folders are in the same directory, with nothing else
>>> >>there
>>> >>> >
>>> >>> >merge.by.folder <- function (folderpath){
>>> >>> >
>>> >>> >foldernames<-list.files(path=folderpath)
>>> >>> >n<- length(foldernames)
>>> >>> >setwd(folderpath)
>>> >>> >
>>> >>> >for (i in 1:n){
>>> >>> >path<-paste(folderpath,foldernames[i], sep="\\")
>>> >>> > nf.name <- as.character(paste(foldernames[i],".csv", sep=""))
>>> >>> >merepk (path,nf.name)
>>> >>> > }
>>> >>> >}
>>> >>> >
>>> >>> >folderpath <- "yourpath"
>>> >>> >
>>> >>> >merge.by.folder(folderpath)
>>> >>> >
>>> >>> >
>>> >>> >Thanks for looking, and happy friday!
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >*Ben Caldwell*
>>> >>> >
>>> >>> >PhD Candidate
>>> >>> >University of California, Berkeley
>>> >>> >
>>> >>> >       [[alternative HTML version deleted]]
>>> >>> >
>>> >>> >______________________________________________
>>> >>> >R-help@r-project.org mailing list
>>> >>> >https://stat.ethz.ch/mailman/listinfo/r-help
>>> >>> >PLEASE do read the posting guide
>>> >>> >http://www.R-project.org/posting-guide.html
>>> >>> >and provide commented, minimal, self-contained, reproducible code.
>>> >>>
>>> >>>
>>> >
>>> > ______________________________________________
>>> > R-help@r-project.org mailing list
>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>> > PLEASE do read the posting guide
>>> > http://www.R-project.org/posting-guide.html
>>> > and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Data Munger Guru
>>>
>>> What is the problem that you are trying to solve?
>>> Tell me what you want to do, not how you want to do it.
>>
>>
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] mergeing a large number of large .csvs

Reply via email to