Re: Reproducing RMA with Gene ST data (Was: Re: [aroma.affymetrix] Re: How do you analyze Gene ST Data?)

pwhiteusa Thu, 04 Dec 2008 13:49:53 -0800

Hi Mark,

I ran your code below on the Mouse Gene ST CDF and agree that like the
human array, the probes used by aroma.affymetrix and oligo are
identical, appart from the missing 45 control probesets I mentioned
previously (I listed them below as requested). For the Human array
were the number of probesets the same using either method?


 length(un)
[1] 35557
 length(ids)
[1] 35512
 setdiff(un,ids)
 [1] 10338032 10338034 10338021 10338023 10338061 10338028 10338012
10338052
 [9] 10338027 10338018 10338046 10338002 10338051 10338019 10338006
10338015
[17] 10338007 10338005 10338010 10338033 10338013 10338048 10338030
10338050
[25] 10338055 10338014 10338024 10338054 10338040 10338043 10338008
10338049
[33] 10338062 10338031 10338022 10338009 10338053 10338020 10338011
10338016
[41] 10338057 10338039 10338058 10338038 10338045

 matches[1:5,]
         pdInfoBuilder unsupportedCDF union
10529921            25             25    25
10423731            25             25    25
10603809            29             29    29
10486041            29             29    29
10341702             4              4     4

 table(matches[,3]-matches[,1])
    0
35512

 table(matches[,3]-matches[,2])
    0
35512

Peter

On Dec 3, 4:36 pm, Mark Robinson <[EMAIL PROTECTED]> wrote:
> Hi all.
>
> First of all, thanks Peter for 1) doing this testing and 2) for  
> spelling everything out.  I expect to refer people to this thread in  
> the future, so thanks for that.
>
> Just wanted to add 3 more tidbits of hopefully useful information.
>
> 1. I dug a bit into why flavor="oligo" doesn't work within  
> aroma.affymetrix.  It turns out it was a simple fix.  Since I don't  
> use it regularly (it doesn't give probe affinities!) and the  
> underlying 'oligo' functions had changed, it stopped working.  Its  
> corrected now.  I've checked in the fix, so flavor='oligo' will be  
> available in the next release.  In my tests, it appears VERY close to  
> 'affy' ... and since its based on 'oligo' code, it should be VERY VERY  
> similar.
>
> ...
> plm1 <- RmaPlm(csN,flavor="oligo")
> fit(plm1,verbose=verbose)
> ces <- getChipEffectSet(plm1)
> data.aroma.oligo <- getExprs(ces)
> ...
>
>  > mean( (data.affy[,1]-data.aroma.oligo[,1])^2 )
> [1] 0.0003193267
>
> 2. I dug a bit into the unsupported CDF and the 'platformDesign'  
> objects from oligo and from what I can tell, the probes used in the  
> 33252 units (I'm looking at Human) within aroma.affymetrix are  
> identical to the probes used within oligo (as built with  
> pdInfoBuilder) ... not a single probe no accounted for.  In case you  
> haven't dug into pdInfoBuilder before and the SQLite db behind, here  
> are some commands you may find useful ...
>
> -------
> library(pd.hugene.1.0.st.v1)
> library(pdInfoBuilder)
> fn <- dir("rawData/tissues/HuGene-1_0-st-v1","CEL",full=TRUE)[1:3]
> x <- read.celfiles(filenames=fn,pkgname="pd.hugene.1.0.st.v1")
> pd <-getPlatformDesign(x)
> ff <- dbGetQuery(db(pd), "select * from pmfeature")
>
> # three 3 lines speed up the splitting ...
> ffs <- split(ff, substr(ff$fsetid,1,4))
> ffs <- unlist( lapply(ffs, FUN=function(u) split(u,u$fsetid)),  
> recursive=FALSE)
> names(ffs) <- substr(names(ffs), 6,nchar(names(ffs)))
>
> cs <- AffymetrixCelSet$fromName(name, chipType=chip)
> cdf <- getCdf(cs)
> cdfCells <- readCdfUnits(getPathname(cdf), units=NULL, readXY=FALSE,  
> readBases=FALSE, readExpos=FALSE,
>                           readType=FALSE,  
> readDirection=FALSE,stratifyBy=c("pm"), readIndices=TRUE, verbose=0)
>
> un <- unique(ff$fsetid)
> ids <- intersect(un,names(cdfCells))
>
> mf <- match(ids,names(ffs))
> mc <- match(ids,names(cdfCells))
>
> matches <- matrix(NA,nr=length(ids),nc=3)
> rownames(matches) <- ids
> colnames(matches) <- c("pdInfoBuilder","unsupportedCDF","union")
>
> for(i in 1:nrow(matches)) {
>    a <- cdfCells[[ ids[i] ]]$groups[[1]]$indices
>    b <- ffs[[ ids[i] ]]$fid
>    matches[i,1] <- length(b)
>    matches[i,2] <- length(a)
>    matches[i,3] <- length(union(a,b))
>    cat(ids[i],"\n")}
>
> -------
>
> ... this gives ..
>  > matches[1:5,]
>          pdInfoBuilder unsupportedCDF union
> 7981326            27             27    27
> 8095005            42             42    42
> 8100310            10             10    10
> 7948117            15             15    15
> 8155877            25             25    25
>  > table(matches[,3]-matches[,1])
>      0
> 33252
>  > table(matches[,3]-matches[,2])
>      0
> 33252
>
> 3.  This doesn't address the problem of the missing probesets.  I'm  
> happy to go and collect these if people want them, but based on the  
> reply from Affymetrix (thanks to Mark Cowley for the leg work here),  
> they are probably 'low-coverage transcript clusters' that can be  
> 'safely ignored'.  See:
>
> http://article.gmane.org/gmane.science.biology.informatics.conductor/...
>
> SUMMARY:
>
> [aroma.affymetrix flavor='oligo'] = rma in affy, rma in oligo
>
> [aroma.affymetrix flavor='affyPLM'] = fitPLM in affyPLM
>
> ... where '=' means 'for all intents and purposes equivalent', not  
> strictly equal.
>
> Cheers,
> Mark
>
> On 04/12/2008, at 7:43 AM, Henrik Bengtsson wrote:
>
> > Hi,
>
> > thanks for sharing all this.
>
> > On Tue, Dec 2, 2008 at 11:54 AM, pwhiteusa <[EMAIL PROTECTED]>  
> > wrote:
>
> >> Hi All,
>
> >> Here is the exact code I used to analyze Gene ST data for an
> >> experiment performed with the MoGene-1_0-st-v1 array.
>
> >> AROMA.AFFYMETRIX
>
> >> library(aroma.affymetrix)
> >> cdf <- AffymetrixCdfFile$fromChipType("MoGene-1_0-st-v1",tags="r3")
> >> prefixName <- "Pierson"
> >> cs <- AffymetrixCelSet$byName(prefixName, cdf=cdf)
> >> bc <- RmaBackgroundCorrection(cs)
> >> csBC <- process(bc)
> >> qn <- QuantileNormalization(csBC, typesToUpdate="pm")
> >> csN <- process(qn)
> >> plm <- RmaPlm(csN, flavor="affyPLM")  #flavor="oligo", must library
> >> (oligo)
> >> fit(plm)
> >> ces <- getChipEffectSet(plm)
> >> getExprs <- function(ces, ...) {
> >> cdf <- getCdf(ces)
> >> theta <- extractMatrix(ces, ..., returnUgcMap=TRUE)
> >> ugcMap <- attr(theta, "unitGroupCellMap")
> >> un<-getUnitNames(cdf, ugcMap[,"unit"])
> >> rownames(theta) <- un
> >> log2(theta)
> >> }
> >> data.aroma <- getExprs(ces)
>
> > I think that getExprs() call can be replaced by:
>
> > data.aroma <- extractDataFrame(ces, addNames=TRUE)
>
> > The difference is that the unit names will be in a separate column and
> > not as row names. You will also get group names and more, but those
> > you can drop if you want to.
>
> > Mark, is it correct that we can "deprecate" the suggestion to use
> > getExprs()?  Btw, is this documented somewhere online, or is this
> > knowledge only from the mailing list?
>
> >> Easy! Now to get the same data using the Affy packages:
>
> >> BIOCONDUCTOR AFFY
>
> >> You first need to create or download your mogene10stv1cdf library  
> >> from
> >> the Affy unsupported CDF file (https://stat.ethz.ch/pipermail/bioc-
> >> devel/2007-October/001403.html has some detail on how to do this).
> >> However, as Mark Robinson pointed out there are potential issues with
> >> using the Affy unsupported CDF files. See the following for some
> >> details:
>
> >>https://stat.ethz.ch/pipermail/bioconductor/2007-November/020188.html
>
> >> library(affy)
> >> AffyRaw <- ReadAffy()
> >> AffyEset <- rma(AffyRaw)
> >> data.affy <- exprs(AffyEset)
>
> >> BIOCONDUCTOR OLIGO
>
> >> Download all the required Affy annotation files to your Mouse Gene v1
> >> ST array directory:
>
> >>http://www.affymetrix.com/support/technical/byproduct.affx?product=mo...
>
> >> setwd("P:\\ANNOTATION\\AffyAnnotation\\Mouse\\MoGene-1_0-st-v1")
> >> library(pdInfoBuilder)
> >> pgfFile <- "MoGene-1_0-st-v1.r3.pgf"
> >> clfFile <- "MoGene-1_0-st-v1.r3.clf"
> >> transFile <- "MoGene-1_0-st-v1.na26.mm9.transcript.csv"
> >> probeFile <- "MoGene-1_0-st-v1.probe.tab"
> >> pkg <- new("AffyGenePDInfoPkgSeed", author="Peter White",
> >> email="[EMAIL PROTECTED]", version="0.1.3",
> >> genomebuild="UCSC mm9,  July 2007", chipName="MoGene10stv1",
> >> manufacturer="affymetrix", biocViews="AnnotationData",
> >> pgfFile=pgfFile, clfFile=clfFile, transFile=transFile,
> >> probeFile=probeFile)
> >> makePdInfoPackage(pkg, destDir=".")
>
> >> #This takes a little while to make the Package. Once created you will
> >> need to install the package from the Windows DOS prompt (navigate to
> >> the annotation directory with the newly created pd package to be
> >> installed):
>
> >> R CMD INSTALL pd.mogene.1.0.st.v1\
>
> >> Note for this to work you need RTools and you Path variable set up
> >> correctly as described at:
>
> >>http://cran.r-project.org/doc/manuals/R-admin.html#The-Windows-toolset)
>
> >> Now return to R, set the working directory to your CEL file  
> >> directory:
>
> >> library(pd.mogene.1.0.st.v1)
> >> library(oligo)
> >> OligoRaw<-read.celfiles(filenames=list.celfiles())
> >> OligoEset<-rma(OligoRaw)
> >> data.oligo<-exprs(OligoEset)
>
> >> COMPARING THE TWO DATASETS
>
> >> Here is what I did to compare the data generate by affy, oligo and
> >> aroma.affymetrix:
>
> >> dim(data.aroma)
> >> [1] 35512    16
> >> dim(data.affy)
> >> [1] 35512    16
> >> length(grep(TRUE, rownames(data.affy)==rownames(data.aroma)))
> >> [1] 35512
>
> > FYI, sum(rownames(data.affy)==rownames(data.aroma)) gives you the
> > same.  Replacing sum() with summary() will also work.
>
> >> The output from both the affy rma and aroma.affymetrix methods  
> >> retains
> >> the same order of probes and cel files so the two files can be
> >> compared directly.
>
> > That is probably because they work of the same CDF, but you should
> > never rely on this/assume that this is always the case.  If you do,
> > you should at least verify that the unit names (and group names)
> > match.
>
> >> However,
>
> >> dim(data.oligo)
> >> [1] 35557    16
>
> >> The normalized data file from the Oligo package includes an  
> >> additional
> >> 45 Transcript IDs (there's no annotation on what these are but they
> >> contain anywhere from 9 to 489 probes per probeset).
>
> > For the record, would you mind posting the names of these "additional"
> > 45 units here?  (I'm sure someone else will search the web later and
> > find this thread very helpful).
>
> >> Fixed this problem as follows:
>
> >> o <- match(rownames(data.aroma), rownames(data.oligo))
> >> data.oligo <- data.oligo[o,]
>
> >>> length(grep(TRUE, rownames(data.affy)==rownames(data.oligo)))
> >> [1] 35512
> >>> length(grep(TRUE, rownames(data.aroma)==rownames(data.oligo)))
> >> [1] 35512
>
> >> Finally, there was one more issue with the aroma data. All elements  
> >> in
> >> the 18th row of the dataset were flagged Na. This transcript ID for
> >> this probeset was 10338063. Looking at the Affy annotation this
> >> appears to be a control probeset with 6,515 probes. Could it have  
> >> been
> >> flagged Na by aroma.affymetrix becuase of this (it was OK with the
> >> oligo and affy rma analyses)??
>
> > Nicely spotted.  Voila'.  From aroma.affymetrix's NEWS file:
>
> > Version: 0.9.0 [2008-02-29]
> > o TIME OPTIMIZATION: Now RmaPlm and ExonRmaPlm turn to median polish
> >  if there are more than 500 cells *and* 6 arrays in the unit group.
> >  Option: aroma.affymetrix.settings$models$RmaPlm
> > $medianPolishThreshold.
> >  Moreover, if the unit group is ridiculously large (5000
>
> ...
>
> read more »
--~--~---------~--~----~------------~-------~--~----~
When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
version of the package, 2) to report the output of sessionInfo() and 
traceback(), and 3) to post a complete code example.


You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group.
To post to this group, send email to aroma-affymetrix@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/aroma-affymetrix?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Reproducing RMA with Gene ST data (Was: Re: [aroma.affymetrix] Re: How do you analyze Gene ST Data?)

Reply via email to