On 09/02/2011 10:19 AM, wang peter wrote:
hi, each one who have RNA-seq pipeline code can you share them with me? thx shan gao
* Scenario This is a simple work flow for single-end RNA-seq looking at differential representation of known genes. The data is in a directory dataDir, as implied by the commands below (<...> indicates info to be provide by the user). dataDir <- <...> fastqDir <- file.path(dataDir, "fastq") bamDir <- file.path(dataDir, "bam") outputDir <- file.path(dataDir, "output") The initial data (up to differential representation analysis) is 'big' and we have access to large-memory linux machines where the multicore package is available. library(multicore) lapply <- mclapply sapply <- function(...) simplify2array(mclapply(...)) The following uses 'devel' versions of R and Bioconductor packages; these will become the next 'release' at the end of October, 2011. * Sequencing Illumina sequencing is done outside R (!). We have access to fastq files, in the fastqDir directory. * Quality assessment Use ShortRead to produce a basic QA report. Down-sample fastq files if they are big. library(ShortRead) fls <- list.files(fastqDir, "fastq$", full=TRUE) names(fls) <- sub(".fastq", "", basename(fls)) ## use FastqSampler if fastq files are large qas <- lapply(seq_along(fls), function(i, fls) qa(readFastq(fls[i]), names(fls)[i]), fls) qa <- do.call(rbind, qas) save(qa, file=file.path(outputDir, "qa.rda") browseURL(report(qa)) * Alignment Alignment is done outside R. Output is a BAM file, one per sample. Biostrings::matchPDict (in some circumstances) and Rsubread might be used. * Known genes Known gene information can come from a variety of sources, conveniently from UCSC or biomaRt using the GenomicFeatures package. These can be saved to disk as sqlite data bases for future use or to be shared with lab mates. There are also semi-annual snapshots available, as used here. library(TxDb.Scerevisiae.UCSC.sacCer2.ensGene) txdb <- Scerevisiae_UCSC_sacCer2_ensGene_TxDb gnModel <- exonsBy(txdb, "gene") * Counting There are different ways to count, some of which are implemented in GenomicRanges::countGappedAlignments. I like a simple approach that counts any kind of overlap between known genes as a 'hit', and that discards a read hitting more than one gene. This will be unsatisfactory in many situations. bamFls <- list.files(bamDir, "bam$", full=TRUE) names(bamFls) <- sub("\\..*", "", basename(bamFls)) counter <- function(fl, gnModel) { aln <- readGappedAlignments(fl) strand(aln) <- "*" # for strand-blind sample prep protocol hits <- countOverlaps(aln, gnModel) counts <- countOverlaps(gnModel, aln[hits==1]) names(counts) <- names(gnModel) counts } counts <- sapply(bamFls, counter, gnModel) save(counts, file=file.path(outputDir, "counts.rda")) * Differential representation edgeR and DESeq are mature packages for the analysis of differential representation, taking a sophisticated approach to the statistical analysis; DEXSeq is a recent package to identify differential exon use. An edgeR work flow is library(edgeR) ## identify treatment groups grp <- factor(<...>) ## create data structures dge <- DGEList(counts, group=grp) dge <- calcNormFactors(dge) ## filter uniformative genes m <- 1e6 * t(t(dge$counts) / dge$samples$lib.size) ridx <- rowSums(m > 1) >= 2 dge <- dge[ridx,] ## comparison between groups design <- model.matrix( ~ grp ) dge <- estimateCommonDisp(dge, design) fit <- glmFit(dge, design, dispersion=dge$common.dispersion) lrTest <- glmLRT(dge, fit, coef=2) tt <- topTags(lrTest, Inf) save(tt, file=file.path(dataDir, "tt.rda")) Two sanity checks along the way are that the treatment groups are relatively distinct in an MDS plot, and that the differentially representation tags really are. plotMDS.DGEList(dge) sapply(rownames(tt$table)[1:4], function(x, grp) tapply(counts[x,], grp, mean), grp) * Annotation Some annotation information is available from the org* model organism packages; the biomaRt package is an alternative. Here we get the sgd ids from the top table, and use these to add gene name annotations library(org.Sc.sgd.db) ttids <- rownames(tt$table) sgd2gnname <- mget(ttids, org.Sc.sgdGENENAME, ifnotfound=NA) ttAnno <- cbind(tt$table, genename=unlist(sgd2gnname)) A next step might involve gene set analyses, as in the goseq package. Martin
[[alternative HTML version deleted]] _______________________________________________ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
-- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 _______________________________________________ Bioc-sig-sequencing mailing list Bioc-sig-sequencing@r-project.org https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing