Re: Reproducing RMA with Gene ST data (Was: Re: [aroma.affymetrix] Re: How do you analyze Gene ST Data?)

Henrik Bengtsson Wed, 03 Dec 2008 12:43:13 -0800

Hi,

thanks for sharing all this.

On Tue, Dec 2, 2008 at 11:54 AM, pwhiteusa <[EMAIL PROTECTED]> wrote:
>
> Hi All,
>
> Here is the exact code I used to analyze Gene ST data for an
> experiment performed with the MoGene-1_0-st-v1 array.
>
> AROMA.AFFYMETRIX
>
> library(aroma.affymetrix)
> cdf <- AffymetrixCdfFile$fromChipType("MoGene-1_0-st-v1",tags="r3")
> prefixName <- "Pierson"
> cs <- AffymetrixCelSet$byName(prefixName, cdf=cdf)
> bc <- RmaBackgroundCorrection(cs)
> csBC <- process(bc)
> qn <- QuantileNormalization(csBC, typesToUpdate="pm")
> csN <- process(qn)
> plm <- RmaPlm(csN, flavor="affyPLM")  #flavor="oligo", must library
> (oligo)
> fit(plm)
> ces <- getChipEffectSet(plm)
> getExprs <- function(ces, ...) {
>  cdf <- getCdf(ces)
>  theta <- extractMatrix(ces, ..., returnUgcMap=TRUE)
>  ugcMap <- attr(theta, "unitGroupCellMap")
>  un<-getUnitNames(cdf, ugcMap[,"unit"])
>  rownames(theta) <- un
>  log2(theta)
> }
> data.aroma <- getExprs(ces)

I think that getExprs() call can be replaced by:

data.aroma <- extractDataFrame(ces, addNames=TRUE)

The difference is that the unit names will be in a separate column and
not as row names. You will also get group names and more, but those
you can drop if you want to.

Mark, is it correct that we can "deprecate" the suggestion to use
getExprs()?  Btw, is this documented somewhere online, or is this
knowledge only from the mailing list?

>
> Easy! Now to get the same data using the Affy packages:
>
>
> BIOCONDUCTOR AFFY
>
> You first need to create or download your mogene10stv1cdf library from
> the Affy unsupported CDF file (https://stat.ethz.ch/pipermail/bioc-
> devel/2007-October/001403.html has some detail on how to do this).
> However, as Mark Robinson pointed out there are potential issues with
> using the Affy unsupported CDF files. See the following for some
> details:
>
> https://stat.ethz.ch/pipermail/bioconductor/2007-November/020188.html
>
> library(affy)
> AffyRaw <- ReadAffy()
> AffyEset <- rma(AffyRaw)
> data.affy <- exprs(AffyEset)
>
>
> BIOCONDUCTOR OLIGO
>
> Download all the required Affy annotation files to your Mouse Gene v1
> ST array directory:
>
> http://www.affymetrix.com/support/technical/byproduct.affx?product=mogene-1_0-st-v1
>
> setwd("P:\\ANNOTATION\\AffyAnnotation\\Mouse\\MoGene-1_0-st-v1")
> library(pdInfoBuilder)
> pgfFile <- "MoGene-1_0-st-v1.r3.pgf"
> clfFile <- "MoGene-1_0-st-v1.r3.clf"
> transFile <- "MoGene-1_0-st-v1.na26.mm9.transcript.csv"
> probeFile <- "MoGene-1_0-st-v1.probe.tab"
> pkg <- new("AffyGenePDInfoPkgSeed", author="Peter White",
> email="[EMAIL PROTECTED]", version="0.1.3",
> genomebuild="UCSC mm9,  July 2007", chipName="MoGene10stv1",
> manufacturer="affymetrix", biocViews="AnnotationData",
> pgfFile=pgfFile, clfFile=clfFile, transFile=transFile,
> probeFile=probeFile)
> makePdInfoPackage(pkg, destDir=".")
>
> #This takes a little while to make the Package. Once created you will
> need to install the package from the Windows DOS prompt (navigate to
> the annotation directory with the newly created pd package to be
> installed):
>
> R CMD INSTALL pd.mogene.1.0.st.v1\
>
> Note for this to work you need RTools and you Path variable set up
> correctly as described at:
>
> http://cran.r-project.org/doc/manuals/R-admin.html#The-Windows-toolset)
>
> Now return to R, set the working directory to your CEL file directory:
>
> library(pd.mogene.1.0.st.v1)
> library(oligo)
> OligoRaw<-read.celfiles(filenames=list.celfiles())
> OligoEset<-rma(OligoRaw)
> data.oligo<-exprs(OligoEset)
>
>
> COMPARING THE TWO DATASETS
>
> Here is what I did to compare the data generate by affy, oligo and
> aroma.affymetrix:
>
> dim(data.aroma)
> [1] 35512    16
> dim(data.affy)
> [1] 35512    16
> length(grep(TRUE, rownames(data.affy)==rownames(data.aroma)))
> [1] 35512

FYI, sum(rownames(data.affy)==rownames(data.aroma)) gives you the
same.  Replacing sum() with summary() will also work.

>
> The output from both the affy rma and aroma.affymetrix methods retains
> the same order of probes and cel files so the two files can be
> compared directly.

That is probably because they work of the same CDF, but you should
never rely on this/assume that this is always the case.  If you do,
you should at least verify that the unit names (and group names)
match.

> However,
>
> dim(data.oligo)
> [1] 35557    16
>
> The normalized data file from the Oligo package includes an additional
> 45 Transcript IDs (there's no annotation on what these are but they
> contain anywhere from 9 to 489 probes per probeset).

For the record, would you mind posting the names of these "additional"
45 units here?  (I'm sure someone else will search the web later and
find this thread very helpful).

> Fixed this problem as follows:
>
> o <- match(rownames(data.aroma), rownames(data.oligo))
> data.oligo <- data.oligo[o,]
>
>> length(grep(TRUE, rownames(data.affy)==rownames(data.oligo)))
> [1] 35512
>> length(grep(TRUE, rownames(data.aroma)==rownames(data.oligo)))
> [1] 35512
>
> Finally, there was one more issue with the aroma data. All elements in
> the 18th row of the dataset were flagged Na. This transcript ID for
> this probeset was 10338063. Looking at the Affy annotation this
> appears to be a control probeset with 6,515 probes. Could it have been
> flagged Na by aroma.affymetrix becuase of this (it was OK with the
> oligo and affy rma analyses)??

Nicely spotted.  Voila'.  From aroma.affymetrix's NEWS file:

Version: 0.9.0 [2008-02-29]
o TIME OPTIMIZATION: Now RmaPlm and ExonRmaPlm turn to median polish
  if there are more than 500 cells *and* 6 arrays in the unit group.
  Option: aroma.affymetrix.settings$models$RmaPlm$medianPolishThreshold.
  Moreover, if the unit group is ridiculously large (5000 cells), the
  unit group is skipped and all returned estimates are NAs.
  Option: aroma.affymetrix.settings$models$RmaPlm$skipThreshold.

>
> e<- (data.aroma - data.affy)
>> mean(as.vector(e^2), na.rm=T)
> [1] 0.1253547
>> sd(as.vector(e^2), na.rm=T)
> [1] 0.2717275
>
> e <- (data.aroma - data.oligo)
>> mean(as.vector(e^2), na.rm=T)
> [1] 0.1239203
>> sd(as.vector(e^2), na.rm=T)
> [1] 0.2653593
>
> As you can see the data does not pass your mean and sd cutoffs of
> <0.0001.
>
> e<- (data.affy - data.oligo)
>> mean(as.vector(e^2), na.rm=T)
> [1] 0.001484371
>> sd(as.vector(e^2), na.rm=T)
> [1] 0.002523521
>
> The difference between the affy and oligo analysis is much less
> striking. To visualize these differences I did the following plot, as
> an example I am just showing the data from the first array but it is
> reflective of all 16 arrays:
>
> plot(data.aroma[,1],data.affy[,1],main="Comparison of Aroma and Affy
> Data",col="red",cex=0.5)
> abline(0,1, lwd=2)
>
> plot(data.aroma[,1],data.oligo[,1],main="Comparison of Aroma and Oligo
> Data",col="red",cex=0.5)
> abline(0,1, lwd=2)
>
> plot(data.affy[,1],data.oligo[,1],main="Comparison of Affy and Oligo
> Data",col="red",cex=0.5)
> abline(0,1, lwd=2)
>
> The group permissions do not permit users to upload files to share
> with the group so I have emailed these plots to Henrik to upload.

FYI, it is possible to attach files to message to the aroma.affymetrix
mailing list.  We had a lot of spam so we had to restrict the
uploading/editing of pages to moderators only.  Until there is a page
online for this, I attach these PNG plots to this message.  Btw, would
you mind putting these notes/figures in a Google Document?  Then we
can link to that page on the webpage, and you can update whenever you
find out about any other differences.

> You will see that although the oligo and affy data is very similar, the
> aroma data differs significantly from either of the other two methods
> and the resultant plots are very different from the QC Plots you
> posted at:
>
> http://groups.google.com/group/aroma-affymetrix/web/redundancy-tests

Yes, it is clear that affy and oligo generate very similar results -
and I believe this is because they are using more or less the same
internal code - I think a lot of the code in oligo was cut'n'pasted
from affy and the slightly modified, but I might be wrong.

As Mark already pointed out, one difference is that aroma.affymetrix
uses a robust linear regression algorithm from affyPLM to fit the
probe-level model, whereas affy/oligo uses a median polish algorithm.
However, this should give such large differences.

RmaBackgroundCorrection should be very similar to affy.  It was Ken
Simpson who implemented this, but if I recall it correctly, he ported
it from the affy code base.

It could be a difference in how the quantile normalization is done.
QuantileNormalization in aroma.affymetrix is also a rank-based density
normalization method and they originate from very similar code bases.
I don't remember all the details of hand, but it uses aroma.light,
which is quite similar to what limma does, which in turn comes from
affy originally.  I don't think the affy method has changed that much
over the years.  Sorry, if I repeat myself (I cannot keep track of
where I said what), but you should also have a look at Vignette
'Empirical probe-signal densities and rank-based quantile
normalization', which illustrates how normalization based on different
sets of probes affect the outcome.

Does the MoGene-1_0-st-v1,r3 CDF have MMs?  If not, how is affy/oligo
dealing with this and are they for sure doing PM-only quantile
normalization?  Doing QuantileNormalization(csBC, typesToUpdate="pm")
will have no problem, so that will be the same regardless.

>
>
> SUMMING UP
>
> As I said in my original post I am not certain how concerned I should
> be about this difference, but on the surface the aroma.affymetrix
> approach appears to call more control probes differentially expressed
> in my data set than are called using the other approaches.
>
> One suggestion was that it might be a difference in the way that affy
> and aroma process the CDF file, but hopefully I have shown by using
> the oligo apporach that the CDF may not be the source for the
> difference.
>
> In the example you give on the redundancy-tests you are using affyPLM.
> However, if I attempt to use affyPLM with Gene ST data it throws a
> memory allocation error (see my thread on the bioconductor mailing
> list for details):
>
> https://stat.ethz.ch/pipermail/bioconductor/2008-December/025369.html

OK.

It would be useful to do a redundancy comparison for HG-U133_Plus_2
based also on affy (and oligo).  I think this is the quickest path to
understanding any differences.   Is that something you are willing to
do?  (Again, a lot of people will find this useful, because that will
show how affy/oligo and affyPLM differ, regardless of
aroma.affymetrix).

>
> Mark suggested that fitPLM may not be able to handle some of the
> incredibly large control probesets on the Gene ST arrays. So without
> being able to run affyPLM it is hard to say if that is the source of
> the difference, but what I can find out online is that the rma and
> affyPLM functions should not return normalized data that is so
> different. So my concern is that it is something more specific with
> the Gene ST arrays and the way in which arom.affymetrix processes them
> vs. the other packages I described. It might well be that aroma is the
> optimal way, but I think it would be good to have a better
> understanding of these differences.

Hopefully we'll figure it out.

As a final note for your information: The aroma.affymetrix
algorithms/implementation has undergone a lot of testing and has a lot
of mileage.  That is, although I will never claim anything to be bug
free, you can at least trust that it is not just a quick "write up".

Thank you!

Henrik

>
> Thanks,
>
> Peter
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
version of the package, 2) to report the output of sessionInfo() and 
traceback(), and 3) to post a complete code example.

You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group.
To post to this group, send email to aroma-affymetrix@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/aroma-affymetrix?hl=en
-~----------~----~----~----~------~----~------~--~---

<<inline: Affy_vs_Oligo_MoGeneST_Data.png>>

<<inline: Aroma_vs_Affy_MoGeneST_Data.png>>

<<inline: Aroma_vs_Oligo_MoGeneST_Data.png>>

Re: Reproducing RMA with Gene ST data (Was: Re: [aroma.affymetrix] Re: How do you analyze Gene ST Data?)

Reply via email to