Re: [R] mergeing a large number of large .csvs

Benjamin Caldwell Sat, 03 Nov 2012 13:10:55 -0700

Jim,

Where can I find documentation of the commands you mention?
Thanks






On Sat, Nov 3, 2012 at 12:15 PM, jim holtman <jholt...@gmail.com> wrote:

> A faster way would be to use something like 'per', 'awk' or 'sed'.
> You can strip off the header line of each CSV (if it has one) and then
> concatenate the files together.  This is very efficient use of memory
> since you are just reading one file at a time and then writing it out.
>  Will probably be a lot faster since no conversions have to be done.
> Once you have the one large file, then you can play with it (load it
> if you have enough memory, or load it into a database).
>
> On Sat, Nov 3, 2012 at 11:37 AM, Jeff Newmiller
> <jdnew...@dcn.davis.ca.us> wrote:
> > On the absence of any data examples from you per the posting guidelines,
> I will refer you to the help files for the melt function in the reshape2
> package.  Note that there can be various mixtures of wide versus long...
> such as a wide file with one date column and columns representing all stock
> prices and all trade volumes. The longest format would be what melt gives
> (date, column name, and value) but an in-between format would have one
> distinct column each for dollar values and volume values with a column
> indicating ticker label and of course another for date.
> >
> > If your csv files can be grouped according to those with similar column
> "types", then as you read them in you can use cbind( csvlabel="somelabel",
> csvdf) to distinguish it and then rbind those data frames together to
> create an intermediate-width data frame. When dealing with large amounts of
> data you will want to minimize the amount of reshaping you do, but it would
> require knowledge of your data and algorithms to say any more.
> >
> ---------------------------------------------------------------------------
> > Jeff Newmiller                        The     .....       .....  Go
> Live...
> > DCN:<jdnew...@dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
> Go...
> >                                       Live:   OO#.. Dead: OO#..  Playing
> > Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> > /Software/Embedded Controllers)               .OO#.       .OO#.
>  rocks...1k
> >
> ---------------------------------------------------------------------------
> > Sent from my phone. Please excuse my brevity.
> >
> > Benjamin Caldwell <btcaldw...@berkeley.edu> wrote:
> >
> >>Jeff,
> >>If you're willing to educate, I'd be happy to learn what wide vs long
> >>format means. I'll give rbind a shot in the meantime.
> >>Ben
> >>On Nov 2, 2012 4:31 PM, "Jeff Newmiller" <jdnew...@dcn.davis.ca.us>
> >>wrote:
> >>
> >>> I would first confirm that you need the data in wide format... many
> >>> algorithms are more efficient in long format anyway, and rbind is way
> >>more
> >>> efficient than merge.
> >>>
> >>> If you feel this is not negotiable, you may want to consider sqldf.
> >>Yes,
> >>> you need to learn a bit of SQL, but it is very well integrated into
> >>R.
> >>>
>
> >>---------------------------------------------------------------------------
> >>> Jeff Newmiller                        The     .....       .....  Go
> >>Live...
> >>> DCN:<jdnew...@dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
> >>> Go...
> >>>                                       Live:   OO#.. Dead: OO#..
> >>Playing
> >>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> >>> /Software/Embedded Controllers)               .OO#.       .OO#.
> >>rocks...1k
> >>>
>
> >>---------------------------------------------------------------------------
> >>> Sent from my phone. Please excuse my brevity.
> >>>
> >>> Benjamin Caldwell <btcaldw...@berkeley.edu> wrote:
> >>>
> >>> >Dear R help;
> >>> >I'm currently trying to combine a large number (about 30 x 30) of
> >>large
> >>> >.csvs together (each at least 10000 records). They are organized by
> >>> >plots,
> >>> >hence 30 X 30, with each group of csvs in a folder which corresponds
> >>to
> >>> >the
> >>> >plot. The unmerged csvs all have the same number of columns (5). The
> >>> >fifth
> >>> >column has a different name for each csv. The number of rows is
> >>> >different.
> >>> >
> >>> >The combined csvs are of course quite large, and the code I'm
> >>running
> >>> >is
> >>> >quite slow - I'm currently running it on a computer with 10 GB ram,
> >>> >ssd,
> >>> >and quad core 2.3 ghz processor; it's taken 8 hours and it's only
> >>75%
> >>> >of
> >>> >the way through (it's hung up on one of the largest data groupings
> >>now
> >>> >for
> >>> >an hour, and using 3.5 gigs of RAM.
> >>> >
> >>> >I know that R isn't the most efficient way of doing this, but I'm
> >>not
> >>> >familiar with sql or C. I wonder if anyone has suggestions for a
> >>> >different
> >>> >way to do this in the R environment. For instance, the key function
> >>now
> >>> >is
> >>> >merge, but I haven't tried join from the plyr package or rbind from
> >>> >base.
> >>> >I'm willing to provide a dropbox link to a couple of these files if
> >>> >you'd
> >>> >like to see the data. My code is as follows:
> >>> >
> >>> >
> >>> >#multmerge is based on code by Tony cookson,
> >>> >
> >>>
> >>
> http://www.r-bloggers.com/merging-multiple-data-files-into-one-data-frame/
> >>> ;
> >>> >The function takes a path. This path should be the name of a folder
> >>> >that
> >>> >contains all of the files you would like to read and merge together
> >>and
> >>> >only those files you would like to merge.
> >>> >
> >>> >multmerge = function(mypath){
> >>> >filenames=list.files(path=mypath, full.names=TRUE)
> >>> >datalist = try(lapply(filenames,
> >>> >function(x){read.csv(file=x,header=T)}))
> >>> >try(Reduce(function(x,y) {merge(x, y, all=TRUE)}, datalist))
> >>> >}
> >>> >
> >>> >#this function renames files using a fixed list and outputs a .csv
> >>> >
> >>> >merepk <- function (path, nf.name) {
> >>> >
> >>> >output<-multmerge(mypath=path)
> >>> >name <- list("x", "y", "z", "depth", "amplitude")
> >>> >try(names(output) <- name)
> >>> >
> >>> >write.csv(output, nf.name)
> >>> >}
> >>> >
> >>> >#assumes all folders are in the same directory, with nothing else
> >>there
> >>> >
> >>> >merge.by.folder <- function (folderpath){
> >>> >
> >>> >foldernames<-list.files(path=folderpath)
> >>> >n<- length(foldernames)
> >>> >setwd(folderpath)
> >>> >
> >>> >for (i in 1:n){
> >>> >path<-paste(folderpath,foldernames[i], sep="\\")
> >>> > nf.name <- as.character(paste(foldernames[i],".csv", sep=""))
> >>> >merepk (path,nf.name)
> >>> > }
> >>> >}
> >>> >
> >>> >folderpath <- "yourpath"
> >>> >
> >>> >merge.by.folder(folderpath)
> >>> >
> >>> >
> >>> >Thanks for looking, and happy friday!
> >>> >
> >>> >
> >>> >
> >>> >*Ben Caldwell*
> >>> >
> >>> >PhD Candidate
> >>> >University of California, Berkeley
> >>> >
> >>> >       [[alternative HTML version deleted]]
> >>> >
> >>> >______________________________________________
> >>> >R-help@r-project.org mailing list
> >>> >https://stat.ethz.ch/mailman/listinfo/r-help
> >>> >PLEASE do read the posting guide
> >>> >http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> >>> >and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>>
> >
> > ______________________________________________
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] mergeing a large number of large .csvs

Reply via email to