Re: Reproducing RMA with Gene ST data (Was: Re: [aroma.affymetrix] Re: How do you analyze Gene ST Data?)

Mark Robinson Wed, 03 Dec 2008 13:36:40 -0800

Hi all.

First of all, thanks Peter for 1) doing this testing and 2) for  
spelling everything out.  I expect to refer people to this thread in  
the future, so thanks for that.


Just wanted to add 3 more tidbits of hopefully useful information.

1. I dug a bit into why flavor="oligo" doesn't work within  
aroma.affymetrix.  It turns out it was a simple fix.  Since I don't  
use it regularly (it doesn't give probe affinities!) and the  
underlying 'oligo' functions had changed, it stopped working.  Its  
corrected now.  I've checked in the fix, so flavor='oligo' will be  
available in the next release.  In my tests, it appears VERY close to  
'affy' ... and since its based on 'oligo' code, it should be VERY VERY  
similar.

...
plm1 <- RmaPlm(csN,flavor="oligo")
fit(plm1,verbose=verbose)
ces <- getChipEffectSet(plm1)
data.aroma.oligo <- getExprs(ces)
...

 > mean( (data.affy[,1]-data.aroma.oligo[,1])^2 )
[1] 0.0003193267

2. I dug a bit into the unsupported CDF and the 'platformDesign'  
objects from oligo and from what I can tell, the probes used in the  
33252 units (I'm looking at Human) within aroma.affymetrix are  
identical to the probes used within oligo (as built with  
pdInfoBuilder) ... not a single probe no accounted for.  In case you  
haven't dug into pdInfoBuilder before and the SQLite db behind, here  
are some commands you may find useful ...

-------
library(pd.hugene.1.0.st.v1)
library(pdInfoBuilder)
fn <- dir("rawData/tissues/HuGene-1_0-st-v1","CEL",full=TRUE)[1:3]
x <- read.celfiles(filenames=fn,pkgname="pd.hugene.1.0.st.v1")
pd <-getPlatformDesign(x)
ff <- dbGetQuery(db(pd), "select * from pmfeature")

# three 3 lines speed up the splitting ...
ffs <- split(ff, substr(ff$fsetid,1,4))
ffs <- unlist( lapply(ffs, FUN=function(u) split(u,u$fsetid)),  
recursive=FALSE)
names(ffs) <- substr(names(ffs), 6,nchar(names(ffs)))

cs <- AffymetrixCelSet$fromName(name, chipType=chip)
cdf <- getCdf(cs)
cdfCells <- readCdfUnits(getPathname(cdf), units=NULL, readXY=FALSE,  
readBases=FALSE, readExpos=FALSE,
                          readType=FALSE,  
readDirection=FALSE,stratifyBy=c("pm"), readIndices=TRUE, verbose=0)

un <- unique(ff$fsetid)
ids <- intersect(un,names(cdfCells))

mf <- match(ids,names(ffs))
mc <- match(ids,names(cdfCells))

matches <- matrix(NA,nr=length(ids),nc=3)
rownames(matches) <- ids
colnames(matches) <- c("pdInfoBuilder","unsupportedCDF","union")

for(i in 1:nrow(matches)) {
   a <- cdfCells[[ ids[i] ]]$groups[[1]]$indices
   b <- ffs[[ ids[i] ]]$fid
   matches[i,1] <- length(b)
   matches[i,2] <- length(a)
   matches[i,3] <- length(union(a,b))
   cat(ids[i],"\n")
}
-------

... this gives ..
 > matches[1:5,]
         pdInfoBuilder unsupportedCDF union
7981326            27             27    27
8095005            42             42    42
8100310            10             10    10
7948117            15             15    15
8155877            25             25    25
 > table(matches[,3]-matches[,1])
     0
33252
 > table(matches[,3]-matches[,2])
     0
33252

3.  This doesn't address the problem of the missing probesets.  I'm  
happy to go and collect these if people want them, but based on the  
reply from Affymetrix (thanks to Mark Cowley for the leg work here),  
they are probably 'low-coverage transcript clusters' that can be  
'safely ignored'.  See:

http://article.gmane.org/gmane.science.biology.informatics.conductor/19738


SUMMARY:

[aroma.affymetrix flavor='oligo'] = rma in affy, rma in oligo

[aroma.affymetrix flavor='affyPLM'] = fitPLM in affyPLM

... where '=' means 'for all intents and purposes equivalent', not  
strictly equal.

Cheers,
Mark





On 04/12/2008, at 7:43 AM, Henrik Bengtsson wrote:

> Hi,
>
> thanks for sharing all this.
>
> On Tue, Dec 2, 2008 at 11:54 AM, pwhiteusa <[EMAIL PROTECTED]>  
> wrote:
>>
>> Hi All,
>>
>> Here is the exact code I used to analyze Gene ST data for an
>> experiment performed with the MoGene-1_0-st-v1 array.
>>
>> AROMA.AFFYMETRIX
>>
>> library(aroma.affymetrix)
>> cdf <- AffymetrixCdfFile$fromChipType("MoGene-1_0-st-v1",tags="r3")
>> prefixName <- "Pierson"
>> cs <- AffymetrixCelSet$byName(prefixName, cdf=cdf)
>> bc <- RmaBackgroundCorrection(cs)
>> csBC <- process(bc)
>> qn <- QuantileNormalization(csBC, typesToUpdate="pm")
>> csN <- process(qn)
>> plm <- RmaPlm(csN, flavor="affyPLM")  #flavor="oligo", must library
>> (oligo)
>> fit(plm)
>> ces <- getChipEffectSet(plm)
>> getExprs <- function(ces, ...) {
>> cdf <- getCdf(ces)
>> theta <- extractMatrix(ces, ..., returnUgcMap=TRUE)
>> ugcMap <- attr(theta, "unitGroupCellMap")
>> un<-getUnitNames(cdf, ugcMap[,"unit"])
>> rownames(theta) <- un
>> log2(theta)
>> }
>> data.aroma <- getExprs(ces)
>
> I think that getExprs() call can be replaced by:
>
> data.aroma <- extractDataFrame(ces, addNames=TRUE)
>
> The difference is that the unit names will be in a separate column and
> not as row names. You will also get group names and more, but those
> you can drop if you want to.
>
> Mark, is it correct that we can "deprecate" the suggestion to use
> getExprs()?  Btw, is this documented somewhere online, or is this
> knowledge only from the mailing list?
>
>>
>> Easy! Now to get the same data using the Affy packages:
>>
>>
>> BIOCONDUCTOR AFFY
>>
>> You first need to create or download your mogene10stv1cdf library  
>> from
>> the Affy unsupported CDF file (https://stat.ethz.ch/pipermail/bioc-
>> devel/2007-October/001403.html has some detail on how to do this).
>> However, as Mark Robinson pointed out there are potential issues with
>> using the Affy unsupported CDF files. See the following for some
>> details:
>>
>> https://stat.ethz.ch/pipermail/bioconductor/2007-November/020188.html
>>
>> library(affy)
>> AffyRaw <- ReadAffy()
>> AffyEset <- rma(AffyRaw)
>> data.affy <- exprs(AffyEset)
>>
>>
>> BIOCONDUCTOR OLIGO
>>
>> Download all the required Affy annotation files to your Mouse Gene v1
>> ST array directory:
>>
>> http://www.affymetrix.com/support/technical/byproduct.affx?product=mogene-1_0-st-v1
>>
>> setwd("P:\\ANNOTATION\\AffyAnnotation\\Mouse\\MoGene-1_0-st-v1")
>> library(pdInfoBuilder)
>> pgfFile <- "MoGene-1_0-st-v1.r3.pgf"
>> clfFile <- "MoGene-1_0-st-v1.r3.clf"
>> transFile <- "MoGene-1_0-st-v1.na26.mm9.transcript.csv"
>> probeFile <- "MoGene-1_0-st-v1.probe.tab"
>> pkg <- new("AffyGenePDInfoPkgSeed", author="Peter White",
>> email="[EMAIL PROTECTED]", version="0.1.3",
>> genomebuild="UCSC mm9,  July 2007", chipName="MoGene10stv1",
>> manufacturer="affymetrix", biocViews="AnnotationData",
>> pgfFile=pgfFile, clfFile=clfFile, transFile=transFile,
>> probeFile=probeFile)
>> makePdInfoPackage(pkg, destDir=".")
>>
>> #This takes a little while to make the Package. Once created you will
>> need to install the package from the Windows DOS prompt (navigate to
>> the annotation directory with the newly created pd package to be
>> installed):
>>
>> R CMD INSTALL pd.mogene.1.0.st.v1\
>>
>> Note for this to work you need RTools and you Path variable set up
>> correctly as described at:
>>
>> http://cran.r-project.org/doc/manuals/R-admin.html#The-Windows-toolset)
>>
>> Now return to R, set the working directory to your CEL file  
>> directory:
>>
>> library(pd.mogene.1.0.st.v1)
>> library(oligo)
>> OligoRaw<-read.celfiles(filenames=list.celfiles())
>> OligoEset<-rma(OligoRaw)
>> data.oligo<-exprs(OligoEset)
>>
>>
>> COMPARING THE TWO DATASETS
>>
>> Here is what I did to compare the data generate by affy, oligo and
>> aroma.affymetrix:
>>
>> dim(data.aroma)
>> [1] 35512    16
>> dim(data.affy)
>> [1] 35512    16
>> length(grep(TRUE, rownames(data.affy)==rownames(data.aroma)))
>> [1] 35512
>
> FYI, sum(rownames(data.affy)==rownames(data.aroma)) gives you the
> same.  Replacing sum() with summary() will also work.
>
>>
>> The output from both the affy rma and aroma.affymetrix methods  
>> retains
>> the same order of probes and cel files so the two files can be
>> compared directly.
>
> That is probably because they work of the same CDF, but you should
> never rely on this/assume that this is always the case.  If you do,
> you should at least verify that the unit names (and group names)
> match.
>
>> However,
>>
>> dim(data.oligo)
>> [1] 35557    16
>>
>> The normalized data file from the Oligo package includes an  
>> additional
>> 45 Transcript IDs (there's no annotation on what these are but they
>> contain anywhere from 9 to 489 probes per probeset).
>
> For the record, would you mind posting the names of these "additional"
> 45 units here?  (I'm sure someone else will search the web later and
> find this thread very helpful).
>
>> Fixed this problem as follows:
>>
>> o <- match(rownames(data.aroma), rownames(data.oligo))
>> data.oligo <- data.oligo[o,]
>>
>>> length(grep(TRUE, rownames(data.affy)==rownames(data.oligo)))
>> [1] 35512
>>> length(grep(TRUE, rownames(data.aroma)==rownames(data.oligo)))
>> [1] 35512
>>
>> Finally, there was one more issue with the aroma data. All elements  
>> in
>> the 18th row of the dataset were flagged Na. This transcript ID for
>> this probeset was 10338063. Looking at the Affy annotation this
>> appears to be a control probeset with 6,515 probes. Could it have  
>> been
>> flagged Na by aroma.affymetrix becuase of this (it was OK with the
>> oligo and affy rma analyses)??
>
> Nicely spotted.  Voila'.  From aroma.affymetrix's NEWS file:
>
> Version: 0.9.0 [2008-02-29]
> o TIME OPTIMIZATION: Now RmaPlm and ExonRmaPlm turn to median polish
>  if there are more than 500 cells *and* 6 arrays in the unit group.
>  Option: aroma.affymetrix.settings$models$RmaPlm 
> $medianPolishThreshold.
>  Moreover, if the unit group is ridiculously large (5000 cells), the
>  unit group is skipped and all returned estimates are NAs.
>  Option: aroma.affymetrix.settings$models$RmaPlm$skipThreshold.
>
>>
>> e<- (data.aroma - data.affy)
>>> mean(as.vector(e^2), na.rm=T)
>> [1] 0.1253547
>>> sd(as.vector(e^2), na.rm=T)
>> [1] 0.2717275
>>
>> e <- (data.aroma - data.oligo)
>>> mean(as.vector(e^2), na.rm=T)
>> [1] 0.1239203
>>> sd(as.vector(e^2), na.rm=T)
>> [1] 0.2653593
>>
>> As you can see the data does not pass your mean and sd cutoffs of
>> <0.0001.
>>
>> e<- (data.affy - data.oligo)
>>> mean(as.vector(e^2), na.rm=T)
>> [1] 0.001484371
>>> sd(as.vector(e^2), na.rm=T)
>> [1] 0.002523521
>>
>> The difference between the affy and oligo analysis is much less
>> striking. To visualize these differences I did the following plot, as
>> an example I am just showing the data from the first array but it is
>> reflective of all 16 arrays:
>>
>> plot(data.aroma[,1],data.affy[,1],main="Comparison of Aroma and Affy
>> Data",col="red",cex=0.5)
>> abline(0,1, lwd=2)
>>
>> plot(data.aroma[,1],data.oligo[,1],main="Comparison of Aroma and  
>> Oligo
>> Data",col="red",cex=0.5)
>> abline(0,1, lwd=2)
>>
>> plot(data.affy[,1],data.oligo[,1],main="Comparison of Affy and Oligo
>> Data",col="red",cex=0.5)
>> abline(0,1, lwd=2)
>>
>> The group permissions do not permit users to upload files to share
>> with the group so I have emailed these plots to Henrik to upload.
>
> FYI, it is possible to attach files to message to the aroma.affymetrix
> mailing list.  We had a lot of spam so we had to restrict the
> uploading/editing of pages to moderators only.  Until there is a page
> online for this, I attach these PNG plots to this message.  Btw, would
> you mind putting these notes/figures in a Google Document?  Then we
> can link to that page on the webpage, and you can update whenever you
> find out about any other differences.
>
>> You will see that although the oligo and affy data is very similar,  
>> the
>> aroma data differs significantly from either of the other two methods
>> and the resultant plots are very different from the QC Plots you
>> posted at:
>>
>> http://groups.google.com/group/aroma-affymetrix/web/redundancy-tests
>
> Yes, it is clear that affy and oligo generate very similar results -
> and I believe this is because they are using more or less the same
> internal code - I think a lot of the code in oligo was cut'n'pasted
> from affy and the slightly modified, but I might be wrong.
>
> As Mark already pointed out, one difference is that aroma.affymetrix
> uses a robust linear regression algorithm from affyPLM to fit the
> probe-level model, whereas affy/oligo uses a median polish algorithm.
> However, this should give such large differences.
>
> RmaBackgroundCorrection should be very similar to affy.  It was Ken
> Simpson who implemented this, but if I recall it correctly, he ported
> it from the affy code base.
>
> It could be a difference in how the quantile normalization is done.
> QuantileNormalization in aroma.affymetrix is also a rank-based density
> normalization method and they originate from very similar code bases.
> I don't remember all the details of hand, but it uses aroma.light,
> which is quite similar to what limma does, which in turn comes from
> affy originally.  I don't think the affy method has changed that much
> over the years.  Sorry, if I repeat myself (I cannot keep track of
> where I said what), but you should also have a look at Vignette
> 'Empirical probe-signal densities and rank-based quantile
> normalization', which illustrates how normalization based on different
> sets of probes affect the outcome.
>
> Does the MoGene-1_0-st-v1,r3 CDF have MMs?  If not, how is affy/oligo
> dealing with this and are they for sure doing PM-only quantile
> normalization?  Doing QuantileNormalization(csBC, typesToUpdate="pm")
> will have no problem, so that will be the same regardless.
>
>>
>>
>> SUMMING UP
>>
>> As I said in my original post I am not certain how concerned I should
>> be about this difference, but on the surface the aroma.affymetrix
>> approach appears to call more control probes differentially expressed
>> in my data set than are called using the other approaches.
>>
>> One suggestion was that it might be a difference in the way that affy
>> and aroma process the CDF file, but hopefully I have shown by using
>> the oligo apporach that the CDF may not be the source for the
>> difference.
>>
>> In the example you give on the redundancy-tests you are using  
>> affyPLM.
>> However, if I attempt to use affyPLM with Gene ST data it throws a
>> memory allocation error (see my thread on the bioconductor mailing
>> list for details):
>>
>> https://stat.ethz.ch/pipermail/bioconductor/2008-December/025369.html
>
> OK.
>
> It would be useful to do a redundancy comparison for HG-U133_Plus_2
> based also on affy (and oligo).  I think this is the quickest path to
> understanding any differences.   Is that something you are willing to
> do?  (Again, a lot of people will find this useful, because that will
> show how affy/oligo and affyPLM differ, regardless of
> aroma.affymetrix).
>
>>
>> Mark suggested that fitPLM may not be able to handle some of the
>> incredibly large control probesets on the Gene ST arrays. So without
>> being able to run affyPLM it is hard to say if that is the source of
>> the difference, but what I can find out online is that the rma and
>> affyPLM functions should not return normalized data that is so
>> different. So my concern is that it is something more specific with
>> the Gene ST arrays and the way in which arom.affymetrix processes  
>> them
>> vs. the other packages I described. It might well be that aroma is  
>> the
>> optimal way, but I think it would be good to have a better
>> understanding of these differences.
>
> Hopefully we'll figure it out.
>
> As a final note for your information: The aroma.affymetrix
> algorithms/implementation has undergone a lot of testing and has a lot
> of mileage.  That is, although I will never claim anything to be bug
> free, you can at least trust that it is not just a quick "write up".
>
> Thank you!
>
> Henrik
>
>>
>> Thanks,
>>
>> Peter
>>
>>>
>>
>
> >
> < 
> Affy_vs_Oligo_MoGeneST_Data 
> .png 
> ><Aroma_vs_Affy_MoGeneST_Data.png><Aroma_vs_Oligo_MoGeneST_Data.png>

------------------------------
Mark Robinson
Epigenetics Laboratory, Garvan
Bioinformatics Division, WEHI
e: [EMAIL PROTECTED]
e: [EMAIL PROTECTED]
p: +61 (0)3 9345 2628
f: +61 (0)3 9347 0852
------------------------------





--~--~---------~--~----~------------~-------~--~----~
When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
version of the package, 2) to report the output of sessionInfo() and 
traceback(), and 3) to post a complete code example.


You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group.
To post to this group, send email to aroma-affymetrix@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/aroma-affymetrix?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Reproducing RMA with Gene ST data (Was: Re: [aroma.affymetrix] Re: How do you analyze Gene ST Data?)

Reply via email to