[aroma.affymetrix] Re: Gene-Level Summarization of Expression Data

Randy Gobbel Mon, 18 Jan 2010 10:15:18 -0800

Aha. I have just discovered that the actual name of the CDF file makes
a difference.  Changing the CDF name back to Hs133P_Hs_REFSEQ.cdf
makes it work without errors. I'm guessing that using a well-known
name like "HG-U133_Plus_2" causes it to use Bioconductor's default
version, in some cases.  Am I right?


-Randy

On Jan 18, 9:47 am, Randy Gobbel <randy.gob...@gmail.com> wrote:
> Hi Mark.
>
> Here's what I was running.  I've cut out the verbose tracing, added in
> some info about the files. The CDF file is actually
> Hs133P_Hs_REFSEQ.cdf from the BrainArray site, renamed--I'm wondering
> if I need to do some more complex conversion on the CEL files.
>
> library(aroma.affymetrix)
> verbose <- Arguments$getVerbose(-8, timestamp=TRUE)
>
> cs <- AffymetrixCelSet$byName('all', chipType='HG-U133_Plus_2')> getCdf(cs)
>
> AffymetrixCdfFile:
> Path: annotationData/chipTypes/HG-U133_Plus_2
> Filename: HG-U133_Plus_2.CDF
> Filesize: 15.21MB
> Chip type: HG-U133_Plus_2
> RAM: 0.00MB
> File format: v4 (binary; XDA)
> Dimension: 1164x1164
> Number of cells: 1354896
> Number of units: 25102
> Cells per unit: 53.98
> Number of QC units: 9
>
> > cs1 <- getFile(cs, 1)
> > cs1
>
> AffymetrixCelFile:
> Name: EA08034_98020_H133+_MCCW199
> Tags:
> Full name: EA08034_98020_H133+_MCCW199
> Pathname: rawData/all/HG-U133_Plus_2/EA08034_98020_H133+_MCCW199.CEL
> File size: 12.93 MB (13555928 bytes)
> RAM: 0.00 MB
> File format: v4 (binary; XDA)
> Platform: Affymetrix
> Chip type: HG-U133_Plus_2
> Timestamp: 2009-12-18 16:21:36> getCdf(cs1)
>
> AffymetrixCdfFile:
> Path: annotationData/chipTypes/HG-U133_Plus_2
> Filename: HG-U133_Plus_2.CDF
> Filesize: 15.21MB
> Chip type: HG-U133_Plus_2
> RAM: 0.00MB
> File format: v4 (binary; XDA)
> Dimension: 1164x1164
> Number of cells: 1354896
> Number of units: 25102
> Cells per unit: 53.98
> Number of QC units: 9
>
> > bc <- RmaBackgroundCorrection(cs)
>
> csBC <- process(bc,verbose=verbose)
> qn <- QuantileNormalization(csBC)
> csN <- process(qn, verbose=verbose)
> plm <- RmaPlm(csN)
> fit(plm, verbose=verbose)
> ces <- getChipEffectSet(plm)
> gExprs <- extractDataFrame(ces, units=NULL, addNames=TRUE)
> Error in list(`extractDataFrame(ces, units = NULL, addNames = TRUE)` =
> <environment>,  :
>
> [2010-01-18 09:13:25] Exception: Range of argument 'indices' is out of
> range [1,30625]: [1,54675]
>   at throw(Exception(...))
>   at throw.default(sprintf("Range of argument '%s' is out of range [%s,
> %s]: [%s,%s]", .name, range[1], range[2], xrange[1], xrange[2]))
>   at throw(sprintf("Range of argument '%s' is out of range [%s,%s]:
> [%s,%s]", .name, range[1], range[2], xrange[1], xrange[2]))
>   at getNumerics.Arguments(static, ..., asMode = "integer", disallow =
> disallow)
>   at getNumerics(static, ..., asMode = "integer", disallow = disallow)
>   at getIntegers.Arguments(static, ..., range = range)
>   at getIntegers(static, ..., range = range)
>   at method(static, ...)
>   at Arguments$getIndices(indices, range = c(1, nbrOfCells), disallow
> = "NaN")
>   at readRawData.AffymetrixCelFile(this, ...)
>   at readRawData(this, ...)
>   at getData.AffymetrixCelFile(this, indices = map[, "cell"], fields =
> celFields[fields])
>   at getData(this, indices = map[, "cell"], fields = celFields
> [fields])
>   at wit> ces
>
> ChipEffectSet:
> Name: all
> Tags: RBC,QN,RMA
> Path: plmData/all,RBC,QN,RMA/HG-U133_Plus_2
> Platform: Affymetrix
> Chip type: HG-U133_Plus_2,monocell
> Number of arrays: 9
> Names: EA08034_98020_H133+_MCCW199, EA08034_98021_H133+_SKINW199, ...,
> EA08034_98031_H133+_PN-1NN2
> Time period: 2010-01-18 09:13:24 -- 2010-01-18 09:13:25
> Total file size: 2.63MB
> RAM: 0.02MB
> Parameters: (probeModel: chr "pm")> getCdf(ces)
>
> AffymetrixCdfFile:
> Path: annotationData/chipTypes/HG-U133_Plus_2
> Filename: HG-U133_Plus_2,monocell.CDF
> Filesize: 4.44MB
> Chip type: HG-U133_Plus_2,monocell
> RAM: 0.00MB
> File format: v4 (binary; XDA)
> Dimension: 175x175
> Number of cells: 30625
> Number of units: 25102
> Cells per unit: 1.22
> Number of QC units: 9
>
>
>
> On Jan 17, 1:59 am, Mark Robinson <mrobin...@wehi.edu.au> wrote:
>
>
>
> > Hi Randy.
>
> >  From that error message, it looks like there was a mix of CDF files  
> > being used (my guess is 54675 corresponds to the number of Affymetrix  
> > probesets, whereas 30625 corresponds to the Refseq reorganization of  
> > probesets).  Can you post the code you ran?
>
> > Cheers,
> > Mark
>
> > On 16-Jan-10, at 11:41 AM, Randy Gobbel wrote:
>
> > > I'm also trying to get gene-level expression values, using HG-
> > > U133_Plus_2 data.  I downloaded the custom CDF that combines probes
> > > into probesets that correspond to RefSeq genes, linked from the
> > > aroma.affymetrix group page for this chip type (Hs133P_Hs_REFSEQ.cdf),
> > > and ran the same set of commands. It works up to the point of trying
> > > to extract expression values, then dies with:
>
> > > Exception: Range of argument 'indices' is out of range [1,30625]:
> > > [1,54675]
>
> > > At this point, I'm not sure what to do next. Suggestions?  It looks
> > > like you were the creator of the CDF--is it the right one for this?
>
> > > -Randy
>
> > > On Jun 19 2009, 10:08 pm, Mark Robinson <mrobin...@wehi.edu.au> wrote:
> > >> Hi Steve.
>
> > >> I don't know how common this is.  Basically, a colleague found agene
> > >> that was very differentially expressed when analyzing using the
> > >> Affymetrix probesets definition and found virtually nothing when  
> > >> using
> > >> the custom CDF that bundles all the probes for agenetogether.  The
> > >> reason was simple.  There were several probesets designed for this  
> > >> geneand presumably they measure different isoforms.  The probes for
> > >> the DE probeset showed the difference, but all the other probesets
> > >> didn't.  When you use a robust linear model like RMA, outliers get
> > >> downweighted.  Because the DE probes accounted for a small proportion
> > >> of the probes (I think there was 3 or 4 other probesets at this
> > >> locus), their effect got washed out.
>
> > >> So, its a tradeoff.  Sometimes (perhaps most of the time) you gain by
> > >> lumping them all together ... more information, more power to detect
> > >> changes.  But, sometimes (perhaps rarely) it can mislead.  I'm sure
> > >> I'm not the only one to observe such things.  The probe-level data
> > >> (usually?) doesn't lie.  But, since you are comparing across
> > >> platforms, you will undoubtedly find this as you go along.  Different
> > >> microarray designs often measure slightly different things.
>
> > >> One other thing.  Be sure to convert your CDF to binary if it is not
> > >> already using affxparser's convertCdf().  Having this info stored in
> > >> binary format will make the processing much faster.  I think the MBNI
> > >> custom CDFs are text.
>
> > >> Cheers,
> > >> Mark
>
> > >> On 20/06/2009, at 6:55 AM, Steve P wrote:
>
> > >>> Mark,
>
> > >>> Thanks for the information. That is very helpful.
>
> > >>> I want to do the latter, which is to "combine probesets such that  
> > >>> all
> > >>> probes for a givengene(by some definition -- RefSeq, Ensembl, etc)
> > >>> are used to arise at the summarize value."
>
> > >>> I was able to obtain a custom CDF for the U133-A array. So I will  
> > >>> try
> > >>> that approach. But part of the reason I want to do this is to be  
> > >>> able
> > >>> to compare values across platforms, so I may need to find/build a
> > >>> custom CDF for the other platform.
>
> > >>> I would appreciate any cautionary advice you have about  
> > >>> summarizing at
> > >>> thegenelevel.
>
> > >>> Regards,
> > >>> -Steve
>
> > >>> On Jun 17, 9:56 am, Steve Piccolo <steve.picc...@gmail.com> wrote:
> > >>>> Yesterday I posted this question to the list, but the spam blocker
> > >>>> didn't
> > >>>> let it through. Below my question is a response from Mark Robinson.
>
> > >>>> ---------------------------------------------------------------------------
> > >>>>  -----------------------------------
>
> > >>>> Following the example provided 
> > >>>> athttp://groups.google.com/group/aroma-affymetrix/web/gene-1-0-st-array
> > >>>> ...
> > >>>> ,
> > >>>> I am running the following code:
>
> > >>>> chipType <- "HT_HG-U133A"
> > >>>> dataSet = "myData"
>
> > >>>> library(aroma.affymetrix)
> > >>>> verbose <- Arguments$getVerbose(-8, timestamp=TRUE)
>
> > >>>> cdf <- AffymetrixCdfFile$byChipType(chipType)
> > >>>> cs <- AffymetrixCelSet$byName(dataSet, cdf=cdf)
>
> > >>>> bc <- RmaBackgroundCorrection(cs)
> > >>>> csBC <- process(bc,verbose=verbose)
> > >>>> qn <- QuantileNormalization(csBC)
> > >>>> csN <- process(qn, verbose=verbose)
>
> > >>>> plm <- RmaPlm(csN)
> > >>>> fit(plm, verbose=verbose)
>
> > >>>> ces <- getChipEffectSet(plm)
> > >>>> gExprs <- extractDataFrame(ces, units=NULL, addNames=TRUE)
>
> > >>>> This seems to be working beautifully.
>
> > >>>> However, I'm doing an analysis that requires my expression values  
> > >>>> to
> > >>>> be summarized at thegenelevel rather than the probeset level.
>
> > >>>> In the gExprs object that results from the above analysis, I get a
> > >>>> data.frame object in which each row contains expression values  
> > >>>> for a
> > >>>> given probeset across all samples. What I would love to see in each
> > >>>> row is an expression value for a givengene. I believe RMA has the
> > >>>> ability to do this, but I'm not sure how to do it via
> > >>>> aroma.affymetrix.
>
> > >>>> Any suggestions? I'm happy to provide any more details that would  
> > >>>> be
> > >>>> helpful.
>
> > >>>> Regards,
> > >>>> -Steve
>
> > >>>> ---------------------------------------------------------------------------
> > >>>>  -----------------------------------
>
> > >>>> Hi Steve.
>
> > >>>> As to your question, it depends on what you need.  When you say you
> > >>>> want
> > >>>> every row to be agene, do you just want to know thegenename that
> > >>>> goes
> > >>>> with the probeset identifier, or do you want to combine probesets
> > >>>> such that
> > >>>> all probes for a givengene(by some definition -- RefSeq, Ensembl,
> > >>>> etc) are
> > >>>> used to arise at the summarize value (a la the MBNI CustomCDF)?
>
> > >>>> If the former, then there are annotation packages within R.
>
> > >>>> If the latter, I have a few cautionary tales of doing this, since  
> > >>>> the
> > >>>> different probesets for a given locus can be measuring different
> > >>>> variants.
> > >>>>  But if you still want to do this, we need to make a CDF file
> > >>>> specific to
> > >>>> the annotation you want.  For the standard HG-U133 arrays, I know
> > >>>> the MBNI
> > >>>> guys made the CDFs and we could use those within aroma.affymetrix.
> > >>>> I don't
> > >>>> know if they build custom CDFs for the HT- arrays.
>
> > >>>> Hope that gets you started....
>
> read more »

-- 
When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
version of the package, 2) to report the output of sessionInfo() and 
traceback(), and 3) to post a complete code example.


You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group.
To post to this group, send email to aroma-affymetrix@googlegroups.com
To unsubscribe from this group, send email to 
aroma-affymetrix-unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/aroma-affymetrix?hl=en

[aroma.affymetrix] Re: Gene-Level Summarization of Expression Data

Reply via email to