Re: [R] mergeing a large number of large .csvs
On 06.11.2012 05:24, Benjamin Caldwell wrote: Thanks to you all. Modifying the code to use rbind instead of merge worked like a charm - I can only speculate why. Because merge is not intended for appending. See its help page. Uwe Ligges Best *Ben Caldwell* PhD Candidate University of California, Berkeley 130 Mulford Hall #3114 Berkeley, CA 94720 Office 223 Mulford Hall (510)859-3358 On Sat, Nov 3, 2012 at 2:19 PM, jim holtman wrote: It easier than that. I forgot I can do it entirely within R: setwd("/temp/csv") files <- Sys.glob("daily*csv") output <- file('Rcombined.csv', 'w') for (i in files){ cat(i, '\n') # write out file processing input <- readLines(i) input <- input[-1L] # delete header writeLines(input, output) } close(output) On Sat, Nov 3, 2012 at 4:56 PM, jim holtman wrote: These are not commands, but programs you can use. Here is a file copy program in "perl" (I spelt it wrong in the email); This will copy all the files that have "daily" in their names. It also skips the first line of each file assuming that it is the header. perl can be found on most systems. www.activestate.com has a version that runs under Windows and that is what I am using. chdir "/temp/csv"; # my directory with files @files = glob "daily*csv"; # get files to copy (daily..csv) open OUTPUT, ">combined.csv"; # output file # loop for each file foreach $file (@files) { print $file, "\n"; # print file being processed open INPUT, "<" . $file; # assume that the first line is a header, so skip it $header = ; @all = ; # read rest of the file close INPUT; print OUTPUT @all; # append to the output } close OUTPUT; Here is what was printed on the console: C:\Users\Owner>perl copyFiles.pl daily.BO.csv daily.C.csv daily.CL.csv daily.CT.csv daily.GC.csv daily.HO.csv daily.KC.csv daily.LA.csv daily.LN.csv daily.LP.csv daily.LX.csv daily.NG.csv daily.S.csv daily.SB.csv daily.SI.csv daily.SM.csv Which was a list of all the files copied. On Sat, Nov 3, 2012 at 4:08 PM, Benjamin Caldwell wrote: Jim, Where can I find documentation of the commands you mention? Thanks On Sat, Nov 3, 2012 at 12:15 PM, jim holtman wrote: A faster way would be to use something like 'per', 'awk' or 'sed'. You can strip off the header line of each CSV (if it has one) and then concatenate the files together. This is very efficient use of memory since you are just reading one file at a time and then writing it out. Will probably be a lot faster since no conversions have to be done. Once you have the one large file, then you can play with it (load it if you have enough memory, or load it into a database). On Sat, Nov 3, 2012 at 11:37 AM, Jeff Newmiller wrote: On the absence of any data examples from you per the posting guidelines, I will refer you to the help files for the melt function in the reshape2 package. Note that there can be various mixtures of wide versus long... such as a wide file with one date column and columns representing all stock prices and all trade volumes. The longest format would be what melt gives (date, column name, and value) but an in-between format would have one distinct column each for dollar values and volume values with a column indicating ticker label and of course another for date. If your csv files can be grouped according to those with similar column "types", then as you read them in you can use cbind( csvlabel="somelabel", csvdf) to distinguish it and then rbind those data frames together to create an intermediate-width data frame. When dealing with large amounts of data you will want to minimize the amount of reshaping you do, but it would require knowledge of your data and algorithms to say any more. --- Jeff NewmillerThe . . Go Live... DCN:Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. Benjamin Caldwell wrote: Jeff, If you're willing to educate, I'd be happy to learn what wide vs long format means. I'll give rbind a shot in the meantime. Ben On Nov 2, 2012 4:31 PM, "Jeff Newmiller" wrote: I would first confirm that you need the data in wide format... many algorithms are more efficient in long format anyway, and rbind is way more efficient than merge. If you feel this is not negotiable, you may want to consider sqldf. Yes, you need to learn a bit of SQL, but it is very well integrated into R. --- Jeff NewmillerThe
Re: [R] mergeing a large number of large .csvs
Thanks to you all. Modifying the code to use rbind instead of merge worked like a charm - I can only speculate why. Best *Ben Caldwell* PhD Candidate University of California, Berkeley 130 Mulford Hall #3114 Berkeley, CA 94720 Office 223 Mulford Hall (510)859-3358 On Sat, Nov 3, 2012 at 2:19 PM, jim holtman wrote: > It easier than that. I forgot I can do it entirely within R: > > setwd("/temp/csv") > files <- Sys.glob("daily*csv") > output <- file('Rcombined.csv', 'w') > for (i in files){ > cat(i, '\n') # write out file processing > input <- readLines(i) > input <- input[-1L] # delete header > writeLines(input, output) > } > close(output) > > > > On Sat, Nov 3, 2012 at 4:56 PM, jim holtman wrote: > > These are not commands, but programs you can use. Here is a file copy > > program in "perl" (I spelt it wrong in the email); This will copy all > > the files that have "daily" in their names. It also skips the first > > line of each file assuming that it is the header. > > > > perl can be found on most systems. www.activestate.com has a > > version that runs under Windows and that is what I am using. > > > > > > chdir "/temp/csv"; # my directory with files > > @files = glob "daily*csv"; # get files to copy (daily..csv) > > open OUTPUT, ">combined.csv"; # output file > > # loop for each file > > foreach $file (@files) { > > print $file, "\n"; # print file being processed > > open INPUT, "<" . $file; > > # assume that the first line is a header, so skip it > > $header = ; > > @all = ; # read rest of the file > > close INPUT; > > print OUTPUT @all; # append to the output > > } > > close OUTPUT; > > > > Here is what was printed on the console: > > > > > > C:\Users\Owner>perl copyFiles.pl > > daily.BO.csv > > daily.C.csv > > daily.CL.csv > > daily.CT.csv > > daily.GC.csv > > daily.HO.csv > > daily.KC.csv > > daily.LA.csv > > daily.LN.csv > > daily.LP.csv > > daily.LX.csv > > daily.NG.csv > > daily.S.csv > > daily.SB.csv > > daily.SI.csv > > daily.SM.csv > > > > Which was a list of all the files copied. > > > > On Sat, Nov 3, 2012 at 4:08 PM, Benjamin Caldwell > > wrote: > >> Jim, > >> > >> Where can I find documentation of the commands you mention? > >> Thanks > >> > >> > >> > >> > >> > >> On Sat, Nov 3, 2012 at 12:15 PM, jim holtman > wrote: > >>> > >>> A faster way would be to use something like 'per', 'awk' or 'sed'. > >>> You can strip off the header line of each CSV (if it has one) and then > >>> concatenate the files together. This is very efficient use of memory > >>> since you are just reading one file at a time and then writing it out. > >>> Will probably be a lot faster since no conversions have to be done. > >>> Once you have the one large file, then you can play with it (load it > >>> if you have enough memory, or load it into a database). > >>> > >>> On Sat, Nov 3, 2012 at 11:37 AM, Jeff Newmiller > >>> wrote: > >>> > On the absence of any data examples from you per the posting > guidelines, > >>> > I will refer you to the help files for the melt function in the > reshape2 > >>> > package. Note that there can be various mixtures of wide versus > long... > >>> > such as a wide file with one date column and columns representing > all stock > >>> > prices and all trade volumes. The longest format would be what melt > gives > >>> > (date, column name, and value) but an in-between format would have > one > >>> > distinct column each for dollar values and volume values with a > column > >>> > indicating ticker label and of course another for date. > >>> > > >>> > If your csv files can be grouped according to those with similar > column > >>> > "types", then as you read them in you can use cbind( > csvlabel="somelabel", > >>> > csvdf) to distinguish it and then rbind those data frames together > to create > >>> > an intermediate-width data frame. When dealing with large amounts of > data > >>> > you will want to minimize the amount of reshaping you do, but it > would > >>> > require knowledge of your data and algorithms to say any more. > >>> > > >>> > > --- > >>> > Jeff NewmillerThe . . Go > >>> > Live... > >>> > DCN:Basics: ##.#. ##.#. > Live > >>> > Go... > >>> > Live: OO#.. Dead: OO#.. > Playing > >>> > Research Engineer (Solar/BatteriesO.O#. #.O#. with > >>> > /Software/Embedded Controllers) .OO#. .OO#. > >>> > rocks...1k > >>> > > >>> > > --- > >>> > Sent from my phone. Please excuse my brevity. > >>> > > >>> > Benjamin Caldwell wrote: > >>> > > >>> >>Jeff, > >>> >>If you're willing to educate, I'd be happy to learn what wide vs long > >>> >>format means. I'll give rbind a shot in the meantime. > >>> >>Ben > >>> >>On Nov 2, 2012 4:31 PM, "Jeff Newmill
Re: [R] mergeing a large number of large .csvs
It easier than that. I forgot I can do it entirely within R: setwd("/temp/csv") files <- Sys.glob("daily*csv") output <- file('Rcombined.csv', 'w') for (i in files){ cat(i, '\n') # write out file processing input <- readLines(i) input <- input[-1L] # delete header writeLines(input, output) } close(output) On Sat, Nov 3, 2012 at 4:56 PM, jim holtman wrote: > These are not commands, but programs you can use. Here is a file copy > program in "perl" (I spelt it wrong in the email); This will copy all > the files that have "daily" in their names. It also skips the first > line of each file assuming that it is the header. > > perl can be found on most systems. www.activestate.com has a > version that runs under Windows and that is what I am using. > > > chdir "/temp/csv"; # my directory with files > @files = glob "daily*csv"; # get files to copy (daily..csv) > open OUTPUT, ">combined.csv"; # output file > # loop for each file > foreach $file (@files) { > print $file, "\n"; # print file being processed > open INPUT, "<" . $file; > # assume that the first line is a header, so skip it > $header = ; > @all = ; # read rest of the file > close INPUT; > print OUTPUT @all; # append to the output > } > close OUTPUT; > > Here is what was printed on the console: > > > C:\Users\Owner>perl copyFiles.pl > daily.BO.csv > daily.C.csv > daily.CL.csv > daily.CT.csv > daily.GC.csv > daily.HO.csv > daily.KC.csv > daily.LA.csv > daily.LN.csv > daily.LP.csv > daily.LX.csv > daily.NG.csv > daily.S.csv > daily.SB.csv > daily.SI.csv > daily.SM.csv > > Which was a list of all the files copied. > > On Sat, Nov 3, 2012 at 4:08 PM, Benjamin Caldwell > wrote: >> Jim, >> >> Where can I find documentation of the commands you mention? >> Thanks >> >> >> >> >> >> On Sat, Nov 3, 2012 at 12:15 PM, jim holtman wrote: >>> >>> A faster way would be to use something like 'per', 'awk' or 'sed'. >>> You can strip off the header line of each CSV (if it has one) and then >>> concatenate the files together. This is very efficient use of memory >>> since you are just reading one file at a time and then writing it out. >>> Will probably be a lot faster since no conversions have to be done. >>> Once you have the one large file, then you can play with it (load it >>> if you have enough memory, or load it into a database). >>> >>> On Sat, Nov 3, 2012 at 11:37 AM, Jeff Newmiller >>> wrote: >>> > On the absence of any data examples from you per the posting guidelines, >>> > I will refer you to the help files for the melt function in the reshape2 >>> > package. Note that there can be various mixtures of wide versus long... >>> > such as a wide file with one date column and columns representing all >>> > stock >>> > prices and all trade volumes. The longest format would be what melt gives >>> > (date, column name, and value) but an in-between format would have one >>> > distinct column each for dollar values and volume values with a column >>> > indicating ticker label and of course another for date. >>> > >>> > If your csv files can be grouped according to those with similar column >>> > "types", then as you read them in you can use cbind( csvlabel="somelabel", >>> > csvdf) to distinguish it and then rbind those data frames together to >>> > create >>> > an intermediate-width data frame. When dealing with large amounts of data >>> > you will want to minimize the amount of reshaping you do, but it would >>> > require knowledge of your data and algorithms to say any more. >>> > >>> > --- >>> > Jeff NewmillerThe . . Go >>> > Live... >>> > DCN:Basics: ##.#. ##.#. Live >>> > Go... >>> > Live: OO#.. Dead: OO#.. Playing >>> > Research Engineer (Solar/BatteriesO.O#. #.O#. with >>> > /Software/Embedded Controllers) .OO#. .OO#. >>> > rocks...1k >>> > >>> > --- >>> > Sent from my phone. Please excuse my brevity. >>> > >>> > Benjamin Caldwell wrote: >>> > >>> >>Jeff, >>> >>If you're willing to educate, I'd be happy to learn what wide vs long >>> >>format means. I'll give rbind a shot in the meantime. >>> >>Ben >>> >>On Nov 2, 2012 4:31 PM, "Jeff Newmiller" >>> >>wrote: >>> >> >>> >>> I would first confirm that you need the data in wide format... many >>> >>> algorithms are more efficient in long format anyway, and rbind is way >>> >>more >>> >>> efficient than merge. >>> >>> >>> >>> If you feel this is not negotiable, you may want to consider sqldf. >>> >>Yes, >>> >>> you need to learn a bit of SQL, but it is very well integrated into >>> >>R. >>> >>> >>> >>> >> >>--- >>> >>> Jeff NewmillerThe . . Go >>> >>Live
Re: [R] mergeing a large number of large .csvs
These are not commands, but programs you can use. Here is a file copy program in "perl" (I spelt it wrong in the email); This will copy all the files that have "daily" in their names. It also skips the first line of each file assuming that it is the header. perl can be found on most systems. www.activestate.com has a version that runs under Windows and that is what I am using. chdir "/temp/csv"; # my directory with files @files = glob "daily*csv"; # get files to copy (daily..csv) open OUTPUT, ">combined.csv"; # output file # loop for each file foreach $file (@files) { print $file, "\n"; # print file being processed open INPUT, "<" . $file; # assume that the first line is a header, so skip it $header = ; @all = ; # read rest of the file close INPUT; print OUTPUT @all; # append to the output } close OUTPUT; Here is what was printed on the console: C:\Users\Owner>perl copyFiles.pl daily.BO.csv daily.C.csv daily.CL.csv daily.CT.csv daily.GC.csv daily.HO.csv daily.KC.csv daily.LA.csv daily.LN.csv daily.LP.csv daily.LX.csv daily.NG.csv daily.S.csv daily.SB.csv daily.SI.csv daily.SM.csv Which was a list of all the files copied. On Sat, Nov 3, 2012 at 4:08 PM, Benjamin Caldwell wrote: > Jim, > > Where can I find documentation of the commands you mention? > Thanks > > > > > > On Sat, Nov 3, 2012 at 12:15 PM, jim holtman wrote: >> >> A faster way would be to use something like 'per', 'awk' or 'sed'. >> You can strip off the header line of each CSV (if it has one) and then >> concatenate the files together. This is very efficient use of memory >> since you are just reading one file at a time and then writing it out. >> Will probably be a lot faster since no conversions have to be done. >> Once you have the one large file, then you can play with it (load it >> if you have enough memory, or load it into a database). >> >> On Sat, Nov 3, 2012 at 11:37 AM, Jeff Newmiller >> wrote: >> > On the absence of any data examples from you per the posting guidelines, >> > I will refer you to the help files for the melt function in the reshape2 >> > package. Note that there can be various mixtures of wide versus long... >> > such as a wide file with one date column and columns representing all stock >> > prices and all trade volumes. The longest format would be what melt gives >> > (date, column name, and value) but an in-between format would have one >> > distinct column each for dollar values and volume values with a column >> > indicating ticker label and of course another for date. >> > >> > If your csv files can be grouped according to those with similar column >> > "types", then as you read them in you can use cbind( csvlabel="somelabel", >> > csvdf) to distinguish it and then rbind those data frames together to >> > create >> > an intermediate-width data frame. When dealing with large amounts of data >> > you will want to minimize the amount of reshaping you do, but it would >> > require knowledge of your data and algorithms to say any more. >> > >> > --- >> > Jeff NewmillerThe . . Go >> > Live... >> > DCN:Basics: ##.#. ##.#. Live >> > Go... >> > Live: OO#.. Dead: OO#.. Playing >> > Research Engineer (Solar/BatteriesO.O#. #.O#. with >> > /Software/Embedded Controllers) .OO#. .OO#. >> > rocks...1k >> > >> > --- >> > Sent from my phone. Please excuse my brevity. >> > >> > Benjamin Caldwell wrote: >> > >> >>Jeff, >> >>If you're willing to educate, I'd be happy to learn what wide vs long >> >>format means. I'll give rbind a shot in the meantime. >> >>Ben >> >>On Nov 2, 2012 4:31 PM, "Jeff Newmiller" >> >>wrote: >> >> >> >>> I would first confirm that you need the data in wide format... many >> >>> algorithms are more efficient in long format anyway, and rbind is way >> >>more >> >>> efficient than merge. >> >>> >> >>> If you feel this is not negotiable, you may want to consider sqldf. >> >>Yes, >> >>> you need to learn a bit of SQL, but it is very well integrated into >> >>R. >> >>> >> >> >> >>--- >> >>> Jeff NewmillerThe . . Go >> >>Live... >> >>> DCN:Basics: ##.#. ##.#. Live >> >>> Go... >> >>> Live: OO#.. Dead: OO#.. >> >>Playing >> >>> Research Engineer (Solar/BatteriesO.O#. #.O#. with >> >>> /Software/Embedded Controllers) .OO#. .OO#. >> >>rocks...1k >> >>> >> >> >> >>--- >> >>> Sent from my phone. Please excuse my brevity. >> >>> >> >>> Benjamin Caldwell wrote: >> >>> >> >>> >Dear R help; >> >>> >I'm currently
Re: [R] mergeing a large number of large .csvs
Jim, Where can I find documentation of the commands you mention? Thanks On Sat, Nov 3, 2012 at 12:15 PM, jim holtman wrote: > A faster way would be to use something like 'per', 'awk' or 'sed'. > You can strip off the header line of each CSV (if it has one) and then > concatenate the files together. This is very efficient use of memory > since you are just reading one file at a time and then writing it out. > Will probably be a lot faster since no conversions have to be done. > Once you have the one large file, then you can play with it (load it > if you have enough memory, or load it into a database). > > On Sat, Nov 3, 2012 at 11:37 AM, Jeff Newmiller > wrote: > > On the absence of any data examples from you per the posting guidelines, > I will refer you to the help files for the melt function in the reshape2 > package. Note that there can be various mixtures of wide versus long... > such as a wide file with one date column and columns representing all stock > prices and all trade volumes. The longest format would be what melt gives > (date, column name, and value) but an in-between format would have one > distinct column each for dollar values and volume values with a column > indicating ticker label and of course another for date. > > > > If your csv files can be grouped according to those with similar column > "types", then as you read them in you can use cbind( csvlabel="somelabel", > csvdf) to distinguish it and then rbind those data frames together to > create an intermediate-width data frame. When dealing with large amounts of > data you will want to minimize the amount of reshaping you do, but it would > require knowledge of your data and algorithms to say any more. > > > --- > > Jeff NewmillerThe . . Go > Live... > > DCN:Basics: ##.#. ##.#. Live > Go... > > Live: OO#.. Dead: OO#.. Playing > > Research Engineer (Solar/BatteriesO.O#. #.O#. with > > /Software/Embedded Controllers) .OO#. .OO#. > rocks...1k > > > --- > > Sent from my phone. Please excuse my brevity. > > > > Benjamin Caldwell wrote: > > > >>Jeff, > >>If you're willing to educate, I'd be happy to learn what wide vs long > >>format means. I'll give rbind a shot in the meantime. > >>Ben > >>On Nov 2, 2012 4:31 PM, "Jeff Newmiller" > >>wrote: > >> > >>> I would first confirm that you need the data in wide format... many > >>> algorithms are more efficient in long format anyway, and rbind is way > >>more > >>> efficient than merge. > >>> > >>> If you feel this is not negotiable, you may want to consider sqldf. > >>Yes, > >>> you need to learn a bit of SQL, but it is very well integrated into > >>R. > >>> > > >>--- > >>> Jeff NewmillerThe . . Go > >>Live... > >>> DCN:Basics: ##.#. ##.#. Live > >>> Go... > >>> Live: OO#.. Dead: OO#.. > >>Playing > >>> Research Engineer (Solar/BatteriesO.O#. #.O#. with > >>> /Software/Embedded Controllers) .OO#. .OO#. > >>rocks...1k > >>> > > >>--- > >>> Sent from my phone. Please excuse my brevity. > >>> > >>> Benjamin Caldwell wrote: > >>> > >>> >Dear R help; > >>> >I'm currently trying to combine a large number (about 30 x 30) of > >>large > >>> >.csvs together (each at least 1 records). They are organized by > >>> >plots, > >>> >hence 30 X 30, with each group of csvs in a folder which corresponds > >>to > >>> >the > >>> >plot. The unmerged csvs all have the same number of columns (5). The > >>> >fifth > >>> >column has a different name for each csv. The number of rows is > >>> >different. > >>> > > >>> >The combined csvs are of course quite large, and the code I'm > >>running > >>> >is > >>> >quite slow - I'm currently running it on a computer with 10 GB ram, > >>> >ssd, > >>> >and quad core 2.3 ghz processor; it's taken 8 hours and it's only > >>75% > >>> >of > >>> >the way through (it's hung up on one of the largest data groupings > >>now > >>> >for > >>> >an hour, and using 3.5 gigs of RAM. > >>> > > >>> >I know that R isn't the most efficient way of doing this, but I'm > >>not > >>> >familiar with sql or C. I wonder if anyone has suggestions for a > >>> >different > >>> >way to do this in the R environment. For instance, the key function > >>now > >>> >is > >>> >merge, but I haven't tried join from the plyr package or rbind from > >>> >base. > >>> >I'm willing to provide a dropbox link to a couple of these files if > >>> >you'd > >>> >like to see the data. My code is as follows: > >>> > > >>> > > >>> >#multmerge is based on code b
Re: [R] mergeing a large number of large .csvs
A faster way would be to use something like 'per', 'awk' or 'sed'. You can strip off the header line of each CSV (if it has one) and then concatenate the files together. This is very efficient use of memory since you are just reading one file at a time and then writing it out. Will probably be a lot faster since no conversions have to be done. Once you have the one large file, then you can play with it (load it if you have enough memory, or load it into a database). On Sat, Nov 3, 2012 at 11:37 AM, Jeff Newmiller wrote: > On the absence of any data examples from you per the posting guidelines, I > will refer you to the help files for the melt function in the reshape2 > package. Note that there can be various mixtures of wide versus long... such > as a wide file with one date column and columns representing all stock prices > and all trade volumes. The longest format would be what melt gives (date, > column name, and value) but an in-between format would have one distinct > column each for dollar values and volume values with a column indicating > ticker label and of course another for date. > > If your csv files can be grouped according to those with similar column > "types", then as you read them in you can use cbind( csvlabel="somelabel", > csvdf) to distinguish it and then rbind those data frames together to create > an intermediate-width data frame. When dealing with large amounts of data you > will want to minimize the amount of reshaping you do, but it would require > knowledge of your data and algorithms to say any more. > --- > Jeff NewmillerThe . . Go Live... > DCN:Basics: ##.#. ##.#. Live Go... > Live: OO#.. Dead: OO#.. Playing > Research Engineer (Solar/BatteriesO.O#. #.O#. with > /Software/Embedded Controllers) .OO#. .OO#. rocks...1k > --- > Sent from my phone. Please excuse my brevity. > > Benjamin Caldwell wrote: > >>Jeff, >>If you're willing to educate, I'd be happy to learn what wide vs long >>format means. I'll give rbind a shot in the meantime. >>Ben >>On Nov 2, 2012 4:31 PM, "Jeff Newmiller" >>wrote: >> >>> I would first confirm that you need the data in wide format... many >>> algorithms are more efficient in long format anyway, and rbind is way >>more >>> efficient than merge. >>> >>> If you feel this is not negotiable, you may want to consider sqldf. >>Yes, >>> you need to learn a bit of SQL, but it is very well integrated into >>R. >>> >>--- >>> Jeff NewmillerThe . . Go >>Live... >>> DCN:Basics: ##.#. ##.#. Live >>> Go... >>> Live: OO#.. Dead: OO#.. >>Playing >>> Research Engineer (Solar/BatteriesO.O#. #.O#. with >>> /Software/Embedded Controllers) .OO#. .OO#. >>rocks...1k >>> >>--- >>> Sent from my phone. Please excuse my brevity. >>> >>> Benjamin Caldwell wrote: >>> >>> >Dear R help; >>> >I'm currently trying to combine a large number (about 30 x 30) of >>large >>> >.csvs together (each at least 1 records). They are organized by >>> >plots, >>> >hence 30 X 30, with each group of csvs in a folder which corresponds >>to >>> >the >>> >plot. The unmerged csvs all have the same number of columns (5). The >>> >fifth >>> >column has a different name for each csv. The number of rows is >>> >different. >>> > >>> >The combined csvs are of course quite large, and the code I'm >>running >>> >is >>> >quite slow - I'm currently running it on a computer with 10 GB ram, >>> >ssd, >>> >and quad core 2.3 ghz processor; it's taken 8 hours and it's only >>75% >>> >of >>> >the way through (it's hung up on one of the largest data groupings >>now >>> >for >>> >an hour, and using 3.5 gigs of RAM. >>> > >>> >I know that R isn't the most efficient way of doing this, but I'm >>not >>> >familiar with sql or C. I wonder if anyone has suggestions for a >>> >different >>> >way to do this in the R environment. For instance, the key function >>now >>> >is >>> >merge, but I haven't tried join from the plyr package or rbind from >>> >base. >>> >I'm willing to provide a dropbox link to a couple of these files if >>> >you'd >>> >like to see the data. My code is as follows: >>> > >>> > >>> >#multmerge is based on code by Tony cookson, >>> > >>> >>http://www.r-bloggers.com/merging-multiple-data-files-into-one-data-frame/ >>> ; >>> >The function takes a path. This path should be the name of a folder >>> >that >>> >contains all of the files you would like to read and merge together >>and >>> >only those files you would like to merge. >>> > >>> >
Re: [R] mergeing a large number of large .csvs
On the absence of any data examples from you per the posting guidelines, I will refer you to the help files for the melt function in the reshape2 package. Note that there can be various mixtures of wide versus long... such as a wide file with one date column and columns representing all stock prices and all trade volumes. The longest format would be what melt gives (date, column name, and value) but an in-between format would have one distinct column each for dollar values and volume values with a column indicating ticker label and of course another for date. If your csv files can be grouped according to those with similar column "types", then as you read them in you can use cbind( csvlabel="somelabel", csvdf) to distinguish it and then rbind those data frames together to create an intermediate-width data frame. When dealing with large amounts of data you will want to minimize the amount of reshaping you do, but it would require knowledge of your data and algorithms to say any more. --- Jeff NewmillerThe . . Go Live... DCN:Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. Benjamin Caldwell wrote: >Jeff, >If you're willing to educate, I'd be happy to learn what wide vs long >format means. I'll give rbind a shot in the meantime. >Ben >On Nov 2, 2012 4:31 PM, "Jeff Newmiller" >wrote: > >> I would first confirm that you need the data in wide format... many >> algorithms are more efficient in long format anyway, and rbind is way >more >> efficient than merge. >> >> If you feel this is not negotiable, you may want to consider sqldf. >Yes, >> you need to learn a bit of SQL, but it is very well integrated into >R. >> >--- >> Jeff NewmillerThe . . Go >Live... >> DCN:Basics: ##.#. ##.#. Live >> Go... >> Live: OO#.. Dead: OO#.. >Playing >> Research Engineer (Solar/BatteriesO.O#. #.O#. with >> /Software/Embedded Controllers) .OO#. .OO#. >rocks...1k >> >--- >> Sent from my phone. Please excuse my brevity. >> >> Benjamin Caldwell wrote: >> >> >Dear R help; >> >I'm currently trying to combine a large number (about 30 x 30) of >large >> >.csvs together (each at least 1 records). They are organized by >> >plots, >> >hence 30 X 30, with each group of csvs in a folder which corresponds >to >> >the >> >plot. The unmerged csvs all have the same number of columns (5). The >> >fifth >> >column has a different name for each csv. The number of rows is >> >different. >> > >> >The combined csvs are of course quite large, and the code I'm >running >> >is >> >quite slow - I'm currently running it on a computer with 10 GB ram, >> >ssd, >> >and quad core 2.3 ghz processor; it's taken 8 hours and it's only >75% >> >of >> >the way through (it's hung up on one of the largest data groupings >now >> >for >> >an hour, and using 3.5 gigs of RAM. >> > >> >I know that R isn't the most efficient way of doing this, but I'm >not >> >familiar with sql or C. I wonder if anyone has suggestions for a >> >different >> >way to do this in the R environment. For instance, the key function >now >> >is >> >merge, but I haven't tried join from the plyr package or rbind from >> >base. >> >I'm willing to provide a dropbox link to a couple of these files if >> >you'd >> >like to see the data. My code is as follows: >> > >> > >> >#multmerge is based on code by Tony cookson, >> > >> >http://www.r-bloggers.com/merging-multiple-data-files-into-one-data-frame/ >> ; >> >The function takes a path. This path should be the name of a folder >> >that >> >contains all of the files you would like to read and merge together >and >> >only those files you would like to merge. >> > >> >multmerge = function(mypath){ >> >filenames=list.files(path=mypath, full.names=TRUE) >> >datalist = try(lapply(filenames, >> >function(x){read.csv(file=x,header=T)})) >> >try(Reduce(function(x,y) {merge(x, y, all=TRUE)}, datalist)) >> >} >> > >> >#this function renames files using a fixed list and outputs a .csv >> > >> >merepk <- function (path, nf.name) { >> > >> >output<-multmerge(mypath=path) >> >name <- list("x", "y", "z", "depth", "amplitude") >> >try(names(output) <- name) >> > >> >write.csv(output, nf.name) >> >} >> > >> >#assumes all folders are in the same directory, with nothing else >there >> > >> >merge.by.folder <- function (folderpath){
Re: [R] mergeing a large number of large .csvs
Jeff, If you're willing to educate, I'd be happy to learn what wide vs long format means. I'll give rbind a shot in the meantime. Ben On Nov 2, 2012 4:31 PM, "Jeff Newmiller" wrote: > I would first confirm that you need the data in wide format... many > algorithms are more efficient in long format anyway, and rbind is way more > efficient than merge. > > If you feel this is not negotiable, you may want to consider sqldf. Yes, > you need to learn a bit of SQL, but it is very well integrated into R. > --- > Jeff NewmillerThe . . Go Live... > DCN:Basics: ##.#. ##.#. Live > Go... > Live: OO#.. Dead: OO#.. Playing > Research Engineer (Solar/BatteriesO.O#. #.O#. with > /Software/Embedded Controllers) .OO#. .OO#. rocks...1k > --- > Sent from my phone. Please excuse my brevity. > > Benjamin Caldwell wrote: > > >Dear R help; > >I'm currently trying to combine a large number (about 30 x 30) of large > >.csvs together (each at least 1 records). They are organized by > >plots, > >hence 30 X 30, with each group of csvs in a folder which corresponds to > >the > >plot. The unmerged csvs all have the same number of columns (5). The > >fifth > >column has a different name for each csv. The number of rows is > >different. > > > >The combined csvs are of course quite large, and the code I'm running > >is > >quite slow - I'm currently running it on a computer with 10 GB ram, > >ssd, > >and quad core 2.3 ghz processor; it's taken 8 hours and it's only 75% > >of > >the way through (it's hung up on one of the largest data groupings now > >for > >an hour, and using 3.5 gigs of RAM. > > > >I know that R isn't the most efficient way of doing this, but I'm not > >familiar with sql or C. I wonder if anyone has suggestions for a > >different > >way to do this in the R environment. For instance, the key function now > >is > >merge, but I haven't tried join from the plyr package or rbind from > >base. > >I'm willing to provide a dropbox link to a couple of these files if > >you'd > >like to see the data. My code is as follows: > > > > > >#multmerge is based on code by Tony cookson, > > > http://www.r-bloggers.com/merging-multiple-data-files-into-one-data-frame/ > ; > >The function takes a path. This path should be the name of a folder > >that > >contains all of the files you would like to read and merge together and > >only those files you would like to merge. > > > >multmerge = function(mypath){ > >filenames=list.files(path=mypath, full.names=TRUE) > >datalist = try(lapply(filenames, > >function(x){read.csv(file=x,header=T)})) > >try(Reduce(function(x,y) {merge(x, y, all=TRUE)}, datalist)) > >} > > > >#this function renames files using a fixed list and outputs a .csv > > > >merepk <- function (path, nf.name) { > > > >output<-multmerge(mypath=path) > >name <- list("x", "y", "z", "depth", "amplitude") > >try(names(output) <- name) > > > >write.csv(output, nf.name) > >} > > > >#assumes all folders are in the same directory, with nothing else there > > > >merge.by.folder <- function (folderpath){ > > > >foldernames<-list.files(path=folderpath) > >n<- length(foldernames) > >setwd(folderpath) > > > >for (i in 1:n){ > >path<-paste(folderpath,foldernames[i], sep="\\") > > nf.name <- as.character(paste(foldernames[i],".csv", sep="")) > >merepk (path,nf.name) > > } > >} > > > >folderpath <- "yourpath" > > > >merge.by.folder(folderpath) > > > > > >Thanks for looking, and happy friday! > > > > > > > >*Ben Caldwell* > > > >PhD Candidate > >University of California, Berkeley > > > > [[alternative HTML version deleted]] > > > >__ > >R-help@r-project.org mailing list > >https://stat.ethz.ch/mailman/listinfo/r-help > >PLEASE do read the posting guide > >http://www.R-project.org/posting-guide.html > >and provide commented, minimal, self-contained, reproducible code. > > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] mergeing a large number of large .csvs
I would first confirm that you need the data in wide format... many algorithms are more efficient in long format anyway, and rbind is way more efficient than merge. If you feel this is not negotiable, you may want to consider sqldf. Yes, you need to learn a bit of SQL, but it is very well integrated into R. --- Jeff NewmillerThe . . Go Live... DCN:Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. Benjamin Caldwell wrote: >Dear R help; >I'm currently trying to combine a large number (about 30 x 30) of large >.csvs together (each at least 1 records). They are organized by >plots, >hence 30 X 30, with each group of csvs in a folder which corresponds to >the >plot. The unmerged csvs all have the same number of columns (5). The >fifth >column has a different name for each csv. The number of rows is >different. > >The combined csvs are of course quite large, and the code I'm running >is >quite slow - I'm currently running it on a computer with 10 GB ram, >ssd, >and quad core 2.3 ghz processor; it's taken 8 hours and it's only 75% >of >the way through (it's hung up on one of the largest data groupings now >for >an hour, and using 3.5 gigs of RAM. > >I know that R isn't the most efficient way of doing this, but I'm not >familiar with sql or C. I wonder if anyone has suggestions for a >different >way to do this in the R environment. For instance, the key function now >is >merge, but I haven't tried join from the plyr package or rbind from >base. >I'm willing to provide a dropbox link to a couple of these files if >you'd >like to see the data. My code is as follows: > > >#multmerge is based on code by Tony cookson, >http://www.r-bloggers.com/merging-multiple-data-files-into-one-data-frame/; >The function takes a path. This path should be the name of a folder >that >contains all of the files you would like to read and merge together and >only those files you would like to merge. > >multmerge = function(mypath){ >filenames=list.files(path=mypath, full.names=TRUE) >datalist = try(lapply(filenames, >function(x){read.csv(file=x,header=T)})) >try(Reduce(function(x,y) {merge(x, y, all=TRUE)}, datalist)) >} > >#this function renames files using a fixed list and outputs a .csv > >merepk <- function (path, nf.name) { > >output<-multmerge(mypath=path) >name <- list("x", "y", "z", "depth", "amplitude") >try(names(output) <- name) > >write.csv(output, nf.name) >} > >#assumes all folders are in the same directory, with nothing else there > >merge.by.folder <- function (folderpath){ > >foldernames<-list.files(path=folderpath) >n<- length(foldernames) >setwd(folderpath) > >for (i in 1:n){ >path<-paste(folderpath,foldernames[i], sep="\\") > nf.name <- as.character(paste(foldernames[i],".csv", sep="")) >merepk (path,nf.name) > } >} > >folderpath <- "yourpath" > >merge.by.folder(folderpath) > > >Thanks for looking, and happy friday! > > > >*Ben Caldwell* > >PhD Candidate >University of California, Berkeley > > [[alternative HTML version deleted]] > >__ >R-help@r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] mergeing a large number of large .csvs
Dear R help; I'm currently trying to combine a large number (about 30 x 30) of large .csvs together (each at least 1 records). They are organized by plots, hence 30 X 30, with each group of csvs in a folder which corresponds to the plot. The unmerged csvs all have the same number of columns (5). The fifth column has a different name for each csv. The number of rows is different. The combined csvs are of course quite large, and the code I'm running is quite slow - I'm currently running it on a computer with 10 GB ram, ssd, and quad core 2.3 ghz processor; it's taken 8 hours and it's only 75% of the way through (it's hung up on one of the largest data groupings now for an hour, and using 3.5 gigs of RAM. I know that R isn't the most efficient way of doing this, but I'm not familiar with sql or C. I wonder if anyone has suggestions for a different way to do this in the R environment. For instance, the key function now is merge, but I haven't tried join from the plyr package or rbind from base. I'm willing to provide a dropbox link to a couple of these files if you'd like to see the data. My code is as follows: #multmerge is based on code by Tony cookson, http://www.r-bloggers.com/merging-multiple-data-files-into-one-data-frame/; The function takes a path. This path should be the name of a folder that contains all of the files you would like to read and merge together and only those files you would like to merge. multmerge = function(mypath){ filenames=list.files(path=mypath, full.names=TRUE) datalist = try(lapply(filenames, function(x){read.csv(file=x,header=T)})) try(Reduce(function(x,y) {merge(x, y, all=TRUE)}, datalist)) } #this function renames files using a fixed list and outputs a .csv merepk <- function (path, nf.name) { output<-multmerge(mypath=path) name <- list("x", "y", "z", "depth", "amplitude") try(names(output) <- name) write.csv(output, nf.name) } #assumes all folders are in the same directory, with nothing else there merge.by.folder <- function (folderpath){ foldernames<-list.files(path=folderpath) n<- length(foldernames) setwd(folderpath) for (i in 1:n){ path<-paste(folderpath,foldernames[i], sep="\\") nf.name <- as.character(paste(foldernames[i],".csv", sep="")) merepk (path,nf.name) } } folderpath <- "yourpath" merge.by.folder(folderpath) Thanks for looking, and happy friday! *Ben Caldwell* PhD Candidate University of California, Berkeley [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.