Hi:

There are some reasonably efficient ways to do this if you're a little
clever. If you use 16 histograms per page, you can get 928 of them onto 58
pages - with 4 per page, with one left over, that would generate 233 pages
of output. I could think of something like the following (pseudocode):

1. Create a matrix or list of file names, such that each row of the matrix,
    or component of the list, contains the names of the files whose
histograms
    you want on a particular page.
2. Write a function that takes a generic row/component of file names as the
input.
    Make sure that the input argument gets converted to a vector, and then
write
    code creating the histograms and corresponding density plots for a
single page
    of output.
3. Use apply() or lapply() to generate the plots, or suitable alternatives
from packages.

I was a little curious how one could pull this off. This is far from
elegant, but it should give you an idea of how it can be done.

# Create 16 files of data, each consisting of 1000 normal random deviates
nams <- sprintf(paste('y%02d.dat', sep = ''), 1:16)
for(i in 1:16)   write(rnorm(1000), file = nams[i])
# or
# sapply(nams, function(x) write(rnorm(1000), file = x) )

# Create a matrix of file names, in this case four per row:
namelist <- matrix(nams, ncol = 4, byrow = TRUE)
namelist

# Create a plot function that will generate k histograms,
# one per data set, and plot them with the variable names in panel strips
# This uses the lattice graphics package and the plyr package

pltfun <- function(filelist, nintervals = 100, col = 'cornsilk') {
    require(lattice)
    require(plyr)
    options(warn = -1)                   # suppress warnings
    ll <- lapply(filelist, scan, what = 0)   # read in data from each file
-> list
    st <- strsplit(filelist, '\\.')            # split input file names -
pick out part before . as varname
    vnames <- sapply(st, function(x) x[1], simplify = TRUE)   # flatten to
vector
    names(ll) <- vnames                # assign vector of names to list
    df <- ldply(ll, cbind)                  # flatten list to data frame
    print(histogram( ~ V1 | .id, data = df, nint = nintervals, col = col,
                     xlab = '', type = 'density', as.table = TRUE,
                     panel = function(x, ...) {
                        panel.histogram(x, ...)
                        panel.mathdensity(dmath = dnorm, col = "black", lwd
= 2,
                                          args = list( mean=mean(x),
sd=sd(x) ) )
                  } ) )
   Sys.sleep(2)            # Period to wait before next function call (secs)
    invisible()
   }

# To see the plots run by, call
apply(namelist, 1, pltfun)

###########################################################
### To export the plots to files, I used the following:

# Create a vector of output file names for plots:
plotnames <- sprintf(paste('plot%02d.pdf', sep = ''), 1:4)

# Tag it on as an extra column of the namelist matrix
allfiles <- cbind(namelist, unname(plotnames))

# Create a generic function to read in the data from a given row of
# namelist, convert it into a data frame, pass it into histogram() and
# export the output to the corresponding plotname file.

pltout <- function(files, nintervals = 100, col = 'cornsilk') {
    options(warn = -1)
    require(lattice)
    require(plyr)
    filelist <- files[-length(files)]    # list of files to read
    plotfile <- files[length(files)]    # name of output file
    ll <- lapply(filelist, scan, what = 0)  # read the data in
    st <- strsplit(filelist, '\\.')           # use the part before the . as
variable names
    vnames <- sapply(st, function(x) x[1], simplify = TRUE)  # flatten to
vector
    names(ll) <- vnames        # apply names to list components
    df <- ldply(ll, cbind)          # flatten to data frame with names as a
variable (.id)
    pdf(file = plotfile)             # open the graphics device
    print(histogram( ~ V1 | .id, data = df, nint = nintervals, col = col,
                     xlab = '', type = 'density', as.table = TRUE,
                     panel = function(x, ...) {
                        panel.histogram(x, ...)
                        panel.mathdensity(dmath = dnorm, col = "black", lwd
= 2,
                                          args = list( mean=mean(x),
sd=sd(x) ) )
                  } ) )
    dev.off()                       # close it
    invisible()
   }

# Apply the function to each row of the allfiles matrix
apply(allfiles, 1, pltout)

This is a comparatively small example, but it gives you a general idea of
how to do this sort of thing. For 929 input files, it would be a good idea
to put them all in one directory. See ?list.files for an easy way to get all
the filenames in a single directory in one call. You can then create a
matrix from that vector and use the above ideas to generate plots.

Part of the reason I did it this way was to economize on memory usage in the
global environment. When each input file is a vector of 1.5 million elements
and you plot four files at a time, the data frame that goes into lattice is
6M x 2. If you were to plot 16 histograms per page, multiply the number of
rows by 4. Not only will the file be four times as long, the size of the
graph in Kb will increase dramatically, too. Testing the difference in
execution time between four graphs of four files vs. one graph of 16 files,

> system.time(apply(allfiles, 1, pltout))
   user  system elapsed
   0.75    0.03    0.79
> system.time(pltout(c(nams, 'hist16.pdf')))
   user  system elapsed
   0.58    0.00    0.62

so the 16 plot graph is a little bit faster for this toy example where the
files had 1000 elements. Yours are much larger, so let's see what happens:

> for(i in 1:16)   write(rnorm(1500000), file = nams[i])
> system.time(pltout(c(nams, 'hist16.pdf')))
Read 1500000 items
<snip>
   user  system elapsed
  65.72   11.39   77.60
> system.time(apply(allfiles, 1, pltout))
Read 1500000 items
<snip>
   user  system elapsed
  48.61    7.74   57.08

It took about 8 minutes to write out the data, so this is comparatively fast
:) It suggests that with 16-histogram plots, it ought to take you somewhere
between 1-1.5 hours to do all 59 plots, while with four histograms at a
time, it should take several minutes less. Having looked at the plots, the
histograms are pretty smooth with sample sizes of 1.5M per plot - you might
want to consider if it's really necessary to plot the entire 1.5M at a time,
or whether plotting all of the tail observations plus a proportional sample
of the center (50 - 60%, maybe?) would be nearly as accurate.

You should also run this script on your own to see how your timings match
with mine, so that you can estimate how long it would take to run these
plots at your site.

HTH,
Dennis





On Wed, Feb 16, 2011 at 2:17 AM, smms <s...@hotmail.ca> wrote:

>
> Hello,
>
> I have multiple data files. Each file contains a single column and 1.5
> million rows. I want to create normalized pdfs (area under curve is 1) and
> histograms to compare with one another. Could anybody suggest if there
> exists an easy way or built in function in R.
>
> At present I am using Origin and Excel together to do this. A single file
> needs 10 minutes and I have a total of 929 files! So you can imagine the
> frustration.
>
> Data file is like this,
>
> 4.3
> 7.6
> 3.2
> 1.6
> .
> .
> .
> .
> 4.6
>
> Many thanks in advance to save my rigorous time.
> --
> View this message in context:
> http://r.789695.n4.nabble.com/how-to-create-normalized-pdf-plot-tp3308522p3308522.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to