Hi: There are some reasonably efficient ways to do this if you're a little clever. If you use 16 histograms per page, you can get 928 of them onto 58 pages - with 4 per page, with one left over, that would generate 233 pages of output. I could think of something like the following (pseudocode):
1. Create a matrix or list of file names, such that each row of the matrix, or component of the list, contains the names of the files whose histograms you want on a particular page. 2. Write a function that takes a generic row/component of file names as the input. Make sure that the input argument gets converted to a vector, and then write code creating the histograms and corresponding density plots for a single page of output. 3. Use apply() or lapply() to generate the plots, or suitable alternatives from packages. I was a little curious how one could pull this off. This is far from elegant, but it should give you an idea of how it can be done. # Create 16 files of data, each consisting of 1000 normal random deviates nams <- sprintf(paste('y%02d.dat', sep = ''), 1:16) for(i in 1:16) write(rnorm(1000), file = nams[i]) # or # sapply(nams, function(x) write(rnorm(1000), file = x) ) # Create a matrix of file names, in this case four per row: namelist <- matrix(nams, ncol = 4, byrow = TRUE) namelist # Create a plot function that will generate k histograms, # one per data set, and plot them with the variable names in panel strips # This uses the lattice graphics package and the plyr package pltfun <- function(filelist, nintervals = 100, col = 'cornsilk') { require(lattice) require(plyr) options(warn = -1) # suppress warnings ll <- lapply(filelist, scan, what = 0) # read in data from each file -> list st <- strsplit(filelist, '\\.') # split input file names - pick out part before . as varname vnames <- sapply(st, function(x) x[1], simplify = TRUE) # flatten to vector names(ll) <- vnames # assign vector of names to list df <- ldply(ll, cbind) # flatten list to data frame print(histogram( ~ V1 | .id, data = df, nint = nintervals, col = col, xlab = '', type = 'density', as.table = TRUE, panel = function(x, ...) { panel.histogram(x, ...) panel.mathdensity(dmath = dnorm, col = "black", lwd = 2, args = list( mean=mean(x), sd=sd(x) ) ) } ) ) Sys.sleep(2) # Period to wait before next function call (secs) invisible() } # To see the plots run by, call apply(namelist, 1, pltfun) ########################################################### ### To export the plots to files, I used the following: # Create a vector of output file names for plots: plotnames <- sprintf(paste('plot%02d.pdf', sep = ''), 1:4) # Tag it on as an extra column of the namelist matrix allfiles <- cbind(namelist, unname(plotnames)) # Create a generic function to read in the data from a given row of # namelist, convert it into a data frame, pass it into histogram() and # export the output to the corresponding plotname file. pltout <- function(files, nintervals = 100, col = 'cornsilk') { options(warn = -1) require(lattice) require(plyr) filelist <- files[-length(files)] # list of files to read plotfile <- files[length(files)] # name of output file ll <- lapply(filelist, scan, what = 0) # read the data in st <- strsplit(filelist, '\\.') # use the part before the . as variable names vnames <- sapply(st, function(x) x[1], simplify = TRUE) # flatten to vector names(ll) <- vnames # apply names to list components df <- ldply(ll, cbind) # flatten to data frame with names as a variable (.id) pdf(file = plotfile) # open the graphics device print(histogram( ~ V1 | .id, data = df, nint = nintervals, col = col, xlab = '', type = 'density', as.table = TRUE, panel = function(x, ...) { panel.histogram(x, ...) panel.mathdensity(dmath = dnorm, col = "black", lwd = 2, args = list( mean=mean(x), sd=sd(x) ) ) } ) ) dev.off() # close it invisible() } # Apply the function to each row of the allfiles matrix apply(allfiles, 1, pltout) This is a comparatively small example, but it gives you a general idea of how to do this sort of thing. For 929 input files, it would be a good idea to put them all in one directory. See ?list.files for an easy way to get all the filenames in a single directory in one call. You can then create a matrix from that vector and use the above ideas to generate plots. Part of the reason I did it this way was to economize on memory usage in the global environment. When each input file is a vector of 1.5 million elements and you plot four files at a time, the data frame that goes into lattice is 6M x 2. If you were to plot 16 histograms per page, multiply the number of rows by 4. Not only will the file be four times as long, the size of the graph in Kb will increase dramatically, too. Testing the difference in execution time between four graphs of four files vs. one graph of 16 files, > system.time(apply(allfiles, 1, pltout)) user system elapsed 0.75 0.03 0.79 > system.time(pltout(c(nams, 'hist16.pdf'))) user system elapsed 0.58 0.00 0.62 so the 16 plot graph is a little bit faster for this toy example where the files had 1000 elements. Yours are much larger, so let's see what happens: > for(i in 1:16) write(rnorm(1500000), file = nams[i]) > system.time(pltout(c(nams, 'hist16.pdf'))) Read 1500000 items <snip> user system elapsed 65.72 11.39 77.60 > system.time(apply(allfiles, 1, pltout)) Read 1500000 items <snip> user system elapsed 48.61 7.74 57.08 It took about 8 minutes to write out the data, so this is comparatively fast :) It suggests that with 16-histogram plots, it ought to take you somewhere between 1-1.5 hours to do all 59 plots, while with four histograms at a time, it should take several minutes less. Having looked at the plots, the histograms are pretty smooth with sample sizes of 1.5M per plot - you might want to consider if it's really necessary to plot the entire 1.5M at a time, or whether plotting all of the tail observations plus a proportional sample of the center (50 - 60%, maybe?) would be nearly as accurate. You should also run this script on your own to see how your timings match with mine, so that you can estimate how long it would take to run these plots at your site. HTH, Dennis On Wed, Feb 16, 2011 at 2:17 AM, smms <s...@hotmail.ca> wrote: > > Hello, > > I have multiple data files. Each file contains a single column and 1.5 > million rows. I want to create normalized pdfs (area under curve is 1) and > histograms to compare with one another. Could anybody suggest if there > exists an easy way or built in function in R. > > At present I am using Origin and Excel together to do this. A single file > needs 10 minutes and I have a total of 929 files! So you can imagine the > frustration. > > Data file is like this, > > 4.3 > 7.6 > 3.2 > 1.6 > . > . > . > . > 4.6 > > Many thanks in advance to save my rigorous time. > -- > View this message in context: > http://r.789695.n4.nabble.com/how-to-create-normalized-pdf-plot-tp3308522p3308522.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.