Re: [Bioc-sig-seq] who have RNA-seq pipeline code

Martin Morgan Fri, 02 Sep 2011 15:31:30 -0700

On 09/02/2011 10:19 AM, wang peter wrote:

hi, each one
who have RNA-seq pipeline code
can you share them with me?
thx
shan gao


* Scenario

This is a simple work flow for single-end RNA-seq looking at
differential representation of known genes. The data is in a directory
dataDir, as implied by the commands below (<...> indicates info to be
provide by the user).

  dataDir <- <...>
  fastqDir <- file.path(dataDir, "fastq")
  bamDir <- file.path(dataDir, "bam")
  outputDir <- file.path(dataDir, "output")

The initial data (up to differential representation analysis) is
'big' and we have access to large-memory linux machines where the
multicore package is available.

  library(multicore)
  lapply <- mclapply
  sapply <- function(...) simplify2array(mclapply(...))

The following uses 'devel' versions of R and Bioconductor packages;
these will become the next 'release' at the end of October, 2011.

* Sequencing

Illumina sequencing is done outside R (!). We have access to fastq
files, in the fastqDir directory.

* Quality assessment

Use ShortRead to produce a basic QA report. Down-sample fastq files if
they are big.

  library(ShortRead)
  fls <- list.files(fastqDir, "fastq$", full=TRUE)
  names(fls) <- sub(".fastq", "", basename(fls))
  ## use FastqSampler if fastq files are large
  qas <- lapply(seq_along(fls),
                function(i, fls) qa(readFastq(fls[i]), names(fls)[i]),
                fls)
  qa <- do.call(rbind, qas)
  save(qa, file=file.path(outputDir, "qa.rda")
  browseURL(report(qa))

* Alignment

Alignment is done outside R. Output is a BAM file, one per
sample. Biostrings::matchPDict (in some circumstances) and Rsubread
might be used.

* Known genes

Known gene information can come from a variety of sources,
conveniently from UCSC or biomaRt using the GenomicFeatures
package. These can be saved to disk as sqlite data bases for future
use or to be shared with lab mates. There are also semi-annual
snapshots available, as used here.

  library(TxDb.Scerevisiae.UCSC.sacCer2.ensGene)
  txdb <- Scerevisiae_UCSC_sacCer2_ensGene_TxDb
  gnModel <- exonsBy(txdb, "gene")

* Counting

There are different ways to count, some of which are implemented in
GenomicRanges::countGappedAlignments. I like a simple approach that
counts any kind of overlap between known genes as a 'hit', and that
discards a read hitting more than one gene. This will be
unsatisfactory in many situations.

  bamFls <- list.files(bamDir, "bam$", full=TRUE)
  names(bamFls) <- sub("\\..*", "", basename(bamFls))
  counter <- function(fl, gnModel)
  {
      aln <- readGappedAlignments(fl)
      strand(aln) <- "*" # for strand-blind sample prep protocol
      hits <- countOverlaps(aln, gnModel)
      counts <- countOverlaps(gnModel, aln[hits==1])
      names(counts) <- names(gnModel)
      counts
  }
  counts <- sapply(bamFls, counter, gnModel)
  save(counts, file=file.path(outputDir, "counts.rda"))

* Differential representation

edgeR and DESeq are mature packages for the analysis of differential
representation, taking a sophisticated approach to the statistical
analysis; DEXSeq is a recent package to identify differential exon
use. An edgeR work flow is

  library(edgeR)

  ## identify treatment groups
  grp <- factor(<...>)

  ## create data structures
  dge <- DGEList(counts, group=grp)
  dge <- calcNormFactors(dge)

  ## filter uniformative genes
  m <- 1e6 * t(t(dge$counts) / dge$samples$lib.size)
  ridx <- rowSums(m > 1) >= 2
  dge <- dge[ridx,]

  ## comparison between groups
  design <- model.matrix( ~ grp )
  dge <- estimateCommonDisp(dge, design)
  fit <- glmFit(dge, design, dispersion=dge$common.dispersion)
  lrTest <- glmLRT(dge, fit, coef=2)
  tt <- topTags(lrTest, Inf)
  save(tt, file=file.path(dataDir, "tt.rda"))

Two sanity checks along the way are that the treatment groups are
relatively distinct in an MDS plot, and that the differentially
representation tags really are.

  plotMDS.DGEList(dge)
  sapply(rownames(tt$table)[1:4],
         function(x, grp) tapply(counts[x,], grp, mean),
         grp)

* Annotation

Some annotation information is available from the org* model organism
packages; the biomaRt package is an alternative. Here we get the sgd
ids from the top table, and use these to add gene name annotations

  library(org.Sc.sgd.db)

  ttids <- rownames(tt$table)
  sgd2gnname <- mget(ttids, org.Sc.sgdGENENAME, ifnotfound=NA)
  ttAnno <- cbind(tt$table, genename=unlist(sgd2gnname))

A next step might involve gene set analyses, as in the goseq package.

Martin


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing



--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

_______________________________________________
Bioc-sig-sequencing mailing list
Bioc-sig-sequencing@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] who have RNA-seq pipeline code

Reply via email to