On Jul 27, 2011, at 9:42 PM, Dennis Murphy wrote:

Hi:

Is this more or less what you're after?

## Note: This is the preferred way to send your data by e-mail.
## I used dput(data-frame-name) to produce this,
## where data-frame-name = 'df' on my end.
df <- structure(list(V1 = c("chr1", "chr1", "chr1", "chr1", "chr3",
"chr4", "chr4", "chr7", "chr7", "chr9", "chr11", "chr11", "chr22",
"chr22", "chr22"), V2 = c(100L, 100L, 200L, 500L, 450L, 100L,
100L, 350L, 350L, 100L, 679L, 679L, 100L, 100L, 300L), V3 = c(159L,
159L, 260L, 750L, 700L, 300L, 300L, 600L, 600L, 125L, 687L, 687L,
200L, 200L, 400L), V4 = c(104L, 145L, 205L, 600L, 500L, 150L,
175L, 400L, 550L, 100L, 680L, 681L, 105L, 110L, 350L), V5 = c(104L,
145L, 205L, 600L, 500L, 150L, 175L, 400L, 550L, 100L, 680L, 681L,
105L, 110L, 350L), V6 = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L,
1L, 1L, 0L, 1L, 1L, 0L), V7 = c(0.05, 0.04, 0.12, 0.09, 0.03,
0.05, 0, 0.06, 0, 0.1, 0.07, 0, 0.03, 0.08, 0), V8 = c("+", "+",
"+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+", "+"
)), .Names = c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8"
), class = "data.frame", row.names = c(NA, -15L))

############
# This is the structure you should see:
str(df)
'data.frame':   15 obs. of  8 variables:
$ V1: chr  "chr1" "chr1" "chr1" "chr1" ...
$ V2: int  100 100 200 500 450 100 100 350 350 100 ...
$ V3: int  159 159 260 750 700 300 300 600 600 125 ...
$ V4: int  104 145 205 600 500 150 175 400 550 100 ...
$ V5: int  104 145 205 600 500 150 175 400 550 100 ...
$ V6: int  1 1 1 1 1 1 0 1 0 1 ...
$ V7: num  0.05 0.04 0.12 0.09 0.03 0.05 0 0.06 0 0.1 ...
$ V8: chr  "+" "+" "+" "+" ...
############

# Method 1: Write a function and call ddply()
summfun <- function(d)  {
   dsum <- as.data.frame(as.list(summary(d[['V7']])))
   names(dsum) <- c('Min', 'Q1', 'Median', 'Mean', 'Q3', 'Max')
   data.frame(V3 = d[1, 'V3'], dsum)
 }
library('plyr')
ddply(df, .(V1, V2), summfun)

The idea behind summfun is this: ddply() prefers functions that take a
data frame as input and a data frame (or scalar) as output. dsum
converts summary(V7) to a data frame by first coercing it into a list
and then to a data frame. The names are changed for convenience. dsum
has one line, so we add V3 to the data frame before outputting it.
ddply() will attach the grouping variables to the output
automatically; however, you can put them into the output data frame
and ddply() will not duplicate the grouping variables in the output.

The alternative in ddply(), which is simpler code, outputs the results
from summary() in different rows for each grouping. In this event, it
is useful to carry along the names of the summaries so that one can
recast the data with the cast() function from the reshape package:

# Method 2: Summarize and reshape
# V3 is unnecessary but it is useful to carry it along for the output
u <- ddply(df, .(V1, V2, V3), summarise, summ = summary(V7),
                      summtype = names(summary(V7)))
library('reshape')
cast(u, V1 + V2 + V3 ~ summtype, value = 'summ')

HTH,
Dennis

PS: I may be one of those folks to whom David was referring in
relation to plyr :)

I've been really impressed at Dennis' facility with plyr, reshape, and reshape2. Note that the 'reshape' function has nothing to do with the 'reshape' package. Here's what I came up with using base functions:

> str(inpdat)
'data.frame':   15 obs. of  8 variables:
 $ chromosome : chr  "chr1" "chr1" "chr1" "chr1" ...
 $ startreg   : int  100 100 200 500 450 100 100 350 350 100 ...
 $ endreg     : int  159 159 260 750 700 300 300 600 600 125 ...
 $ base1      : int  104 145 205 600 500 150 175 400 550 100 ...
 $ base2      : int  104 145 205 600 500 150 175 400 550 100 ...
 $ totalreads : int  1 1 1 1 1 1 0 1 0 1 ...
 $ methylation: num  0.05 0.04 0.12 0.09 0.03 0.05 0 0.06 0 0.1 ...
 $ strand     : chr  "+" "+" "+" "+" ...
# The split into distinct 'chromosome' and 'startreg' categories:
splinp <- split(inpdat, paste(inpdat$chromosome, inpdat$startreg) )

# Process within separate categories: the tapply, aggragate and by functions are all related

> df <- as.data.frame( t(sapply(splinp, function(x) list(chr=x $chromosome[1], strt=x$startreg[1], end=x$endreg[1], frac=sum(x[['totalreads']]>=1)/nrow(x) )) ) )
# You often need the t() function when working with apply functions
> df
            chr strt end frac
chr1 100   chr1  100 159    1
chr1 200   chr1  200 260    1
chr1 500   chr1  500 750    1
chr11 679 chr11  679 687  0.5
chr22 100 chr22  100 200    1
chr22 300 chr22  300 400    0
chr3 450   chr3  450 700    1
chr4 100   chr4  100 300  0.5
chr7 350   chr7  350 600  0.5
chr9 100   chr9  100 125    1

> as.data.frame(t(sapply(splinp, function(x) summary(x $methylation )) ) )
          Min. 1st Qu. Median  Mean 3rd Qu. Max.
chr1 100  0.04  0.0425  0.045 0.045  0.0475 0.05
chr1 200  0.12  0.1200  0.120 0.120  0.1200 0.12
chr1 500  0.09  0.0900  0.090 0.090  0.0900 0.09
chr11 679 0.00  0.0175  0.035 0.035  0.0525 0.07
chr22 100 0.03  0.0425  0.055 0.055  0.0675 0.08
chr22 300 0.00  0.0000  0.000 0.000  0.0000 0.00
chr3 450  0.03  0.0300  0.030 0.030  0.0300 0.03
chr4 100  0.00  0.0125  0.025 0.025  0.0375 0.05
chr7 350  0.00  0.0150  0.030 0.030  0.0450 0.06
chr9 100  0.10  0.1000  0.100 0.100  0.1000 0.10

# The coup de grace: bind the columns together

> df.summ <- as.data.frame(t(sapply(splinp, function(x) summary(x $methylation )) ) )
> cbind(df, df.summ)
            chr strt end frac Min. 1st Qu. Median  Mean 3rd Qu. Max.
chr1 100   chr1  100 159    1 0.04  0.0425  0.045 0.045  0.0475 0.05
chr1 200   chr1  200 260    1 0.12  0.1200  0.120 0.120  0.1200 0.12
chr1 500   chr1  500 750    1 0.09  0.0900  0.090 0.090  0.0900 0.09
chr11 679 chr11  679 687  0.5 0.00  0.0175  0.035 0.035  0.0525 0.07
chr22 100 chr22  100 200    1 0.03  0.0425  0.055 0.055  0.0675 0.08
chr22 300 chr22  300 400    0 0.00  0.0000  0.000 0.000  0.0000 0.00
chr3 450   chr3  450 700    1 0.03  0.0300  0.030 0.030  0.0300 0.03
chr4 100   chr4  100 300  0.5 0.00  0.0125  0.025 0.025  0.0375 0.05
chr7 350   chr7  350 600  0.5 0.00  0.0150  0.030 0.030  0.0450 0.06
chr9 100   chr9  100 125    1 0.10  0.1000  0.100 0.100  0.1000 0.10

--
Best;

David.


On Wed, Jul 27, 2011 at 4:02 PM, a217 <aj...@case.edu> wrote:
Hello,

I have an input file:
http://r.789695.n4.nabble.com/file/n3700031/testOut.txt testOut.txt

where col 1 is chromosome, column2 is start of region, column 3 is end of region, column 4 and 5 is base position, column 6 is total reads, column 7
is methylation data, and column 8 is the strand.


I would like a summary output file such as:
http://r.789695.n4.nabble.com/file/n3700031/out.summary.txt out.summary.txt

where column 1 is chromosome, column 2 is start of region, column 3 is end of region, column 4 is total reads in general, column 5 is total reads >=1, column 6 is (col4/col5) or the percentage, and at the end I'd like to list 6
more columns based on summary results from summary() function in R.

The summary() function will be used to analyze all of the methylation data
(col7 from input) for each region (bounded by col2 and col3).

For example for chr1 100 159 summary() gives:
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 0.0400  0.0425  0.0450  0.0450  0.0475  0.0500

which is simply the methylation data input into summary() only in the region
of chr1 100 159.

I know how to perform all of the required functions line-by-line, but the
hard part for me is essentially taking the input data with multiple
positions in each region and assigning all of the summary results to one
line identified by the region.

If any of you have any suggestions I would appreciate it.

--
View this message in context: 
http://r.789695.n4.nabble.com/Writing-a-summary-file-in-R-tp3700031p3700031.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to