[aroma.affymetrix] Re: Gene-Level Summarization of Expression Data

Randy Gobbel Mon, 18 Jan 2010 11:46:03 -0800

As far as why I renamed the CDF, I find the naming/tagging scheme a
little confusing, in that I'm not sure when renaming/retagging has
consequences, when not, and what is likely to happen if I do that.
Since I ran the analysis with the renamed file in a completely
separate directory tree, I'm sort of curious as to how the information
from the HG-U133_Plus_2 CDF managed to be imported at all.


I now have expression numbers that were derived using
Hs133P_Hs_REFSEQ.cdf, and have moved on to running LIMMA over the
results.  It appears that the probe names are taken from the ReqSeq
sources, with "_at" appended.  It also appears that probes that share
a RefSeq source all have the same expression number.  What I'd like
for my analysis is to have only one probe per gene, i.e., for the
expression values to be unique, but I'm not sure how best to get that.
Suggestions?

-Randy

On Jan 18, 10:31 am, Henrik Bengtsson <henrik.bengts...@gmail.com>
wrote:
> On Mon, Jan 18, 2010 at 10:15 AM, Randy Gobbel <randy.gob...@gmail.com> wrote:
> > Aha. I have just discovered that the actual name of the CDF file makes
> > a difference.  Changing the CDF name back to Hs133P_Hs_REFSEQ.cdf
> > makes it work without errors. I'm guessing that using a well-known
> > name like "HG-U133_Plus_2" causes it to use Bioconductor's default
> > version, in some cases.  Am I right?
>
> This is explains your problems.  Why did you rename the CDF in the
> first place?  I'm not being sarcastic, but it is useful for me to
> understand users' behaviors when developing the software.
>
> The name of the CDF file is very important.  If you find yourself
> doing "tricks" like renaming annotation data files to get you things
> working, you are probably not doing the correct thing.  The aroma.*
> framework tries it best to catch this and prevent the mistake from
> propagating, and I guess we succeeded.
>
> Bioconductor annotation data packages are *not* involved here.  Please
> note that a CDF is formally a *.CDF file following a file format
> defined by Affymetrix.  On Bioconductor there are cdf environment/cdf
> packages, which are very unfortunate names because people started to
> call the just "cdfs", which is wrong.
>
> /Henrik
>
>
>
>
>
> > -Randy
>
> > On Jan 18, 9:47 am, Randy Gobbel <randy.gob...@gmail.com> wrote:
> >> Hi Mark.
>
> >> Here's what I was running.  I've cut out the verbose tracing, added in
> >> some info about the files. The CDF file is actually
> >> Hs133P_Hs_REFSEQ.cdf from the BrainArray site, renamed--I'm wondering
> >> if I need to do some more complex conversion on the CEL files.
>
> >> library(aroma.affymetrix)
> >> verbose <- Arguments$getVerbose(-8, timestamp=TRUE)
>
> >> cs <- AffymetrixCelSet$byName('all', chipType='HG-U133_Plus_2')> getCdf(cs)
>
> >> AffymetrixCdfFile:
> >> Path: annotationData/chipTypes/HG-U133_Plus_2
> >> Filename: HG-U133_Plus_2.CDF
> >> Filesize: 15.21MB
> >> Chip type: HG-U133_Plus_2
> >> RAM: 0.00MB
> >> File format: v4 (binary; XDA)
> >> Dimension: 1164x1164
> >> Number of cells: 1354896
> >> Number of units: 25102
> >> Cells per unit: 53.98
> >> Number of QC units: 9
>
> >> > cs1 <- getFile(cs, 1)
> >> > cs1
>
> >> AffymetrixCelFile:
> >> Name: EA08034_98020_H133+_MCCW199
> >> Tags:
> >> Full name: EA08034_98020_H133+_MCCW199
> >> Pathname: rawData/all/HG-U133_Plus_2/EA08034_98020_H133+_MCCW199.CEL
> >> File size: 12.93 MB (13555928 bytes)
> >> RAM: 0.00 MB
> >> File format: v4 (binary; XDA)
> >> Platform: Affymetrix
> >> Chip type: HG-U133_Plus_2
> >> Timestamp: 2009-12-18 16:21:36> getCdf(cs1)
>
> >> AffymetrixCdfFile:
> >> Path: annotationData/chipTypes/HG-U133_Plus_2
> >> Filename: HG-U133_Plus_2.CDF
> >> Filesize: 15.21MB
> >> Chip type: HG-U133_Plus_2
> >> RAM: 0.00MB
> >> File format: v4 (binary; XDA)
> >> Dimension: 1164x1164
> >> Number of cells: 1354896
> >> Number of units: 25102
> >> Cells per unit: 53.98
> >> Number of QC units: 9
>
> >> > bc <- RmaBackgroundCorrection(cs)
>
> >> csBC <- process(bc,verbose=verbose)
> >> qn <- QuantileNormalization(csBC)
> >> csN <- process(qn, verbose=verbose)
> >> plm <- RmaPlm(csN)
> >> fit(plm, verbose=verbose)
> >> ces <- getChipEffectSet(plm)
> >> gExprs <- extractDataFrame(ces, units=NULL, addNames=TRUE)
> >> Error in list(`extractDataFrame(ces, units = NULL, addNames = TRUE)` =
> >> <environment>,  :
>
> >> [2010-01-18 09:13:25] Exception: Range of argument 'indices' is out of
> >> range [1,30625]: [1,54675]
> >>   at throw(Exception(...))
> >>   at throw.default(sprintf("Range of argument '%s' is out of range [%s,
> >> %s]: [%s,%s]", .name, range[1], range[2], xrange[1], xrange[2]))
> >>   at throw(sprintf("Range of argument '%s' is out of range [%s,%s]:
> >> [%s,%s]", .name, range[1], range[2], xrange[1], xrange[2]))
> >>   at getNumerics.Arguments(static, ..., asMode = "integer", disallow =
> >> disallow)
> >>   at getNumerics(static, ..., asMode = "integer", disallow = disallow)
> >>   at getIntegers.Arguments(static, ..., range = range)
> >>   at getIntegers(static, ..., range = range)
> >>   at method(static, ...)
> >>   at Arguments$getIndices(indices, range = c(1, nbrOfCells), disallow
> >> = "NaN")
> >>   at readRawData.AffymetrixCelFile(this, ...)
> >>   at readRawData(this, ...)
> >>   at getData.AffymetrixCelFile(this, indices = map[, "cell"], fields =
> >> celFields[fields])
> >>   at getData(this, indices = map[, "cell"], fields = celFields
> >> [fields])
> >>   at wit> ces
>
> >> ChipEffectSet:
> >> Name: all
> >> Tags: RBC,QN,RMA
> >> Path: plmData/all,RBC,QN,RMA/HG-U133_Plus_2
> >> Platform: Affymetrix
> >> Chip type: HG-U133_Plus_2,monocell
> >> Number of arrays: 9
> >> Names: EA08034_98020_H133+_MCCW199, EA08034_98021_H133+_SKINW199, ...,
> >> EA08034_98031_H133+_PN-1NN2
> >> Time period: 2010-01-18 09:13:24 -- 2010-01-18 09:13:25
> >> Total file size: 2.63MB
> >> RAM: 0.02MB
> >> Parameters: (probeModel: chr "pm")> getCdf(ces)
>
> >> AffymetrixCdfFile:
> >> Path: annotationData/chipTypes/HG-U133_Plus_2
> >> Filename: HG-U133_Plus_2,monocell.CDF
> >> Filesize: 4.44MB
> >> Chip type: HG-U133_Plus_2,monocell
> >> RAM: 0.00MB
> >> File format: v4 (binary; XDA)
> >> Dimension: 175x175
> >> Number of cells: 30625
> >> Number of units: 25102
> >> Cells per unit: 1.22
> >> Number of QC units: 9
>
> >> On Jan 17, 1:59 am, Mark Robinson <mrobin...@wehi.edu.au> wrote:
>
> >> > Hi Randy.
>
> >> >  From that error message, it looks like there was a mix of CDF files
> >> > being used (my guess is 54675 corresponds to the number of Affymetrix
> >> > probesets, whereas 30625 corresponds to the Refseq reorganization of
> >> > probesets).  Can you post the code you ran?
>
> >> > Cheers,
> >> > Mark
>
> >> > On 16-Jan-10, at 11:41 AM, Randy Gobbel wrote:
>
> >> > > I'm also trying to get gene-level expression values, using HG-
> >> > > U133_Plus_2 data.  I downloaded the custom CDF that combines probes
> >> > > into probesets that correspond to RefSeq genes, linked from the
> >> > > aroma.affymetrix group page for this chip type (Hs133P_Hs_REFSEQ.cdf),
> >> > > and ran the same set of commands. It works up to the point of trying
> >> > > to extract expression values, then dies with:
>
> >> > > Exception: Range of argument 'indices' is out of range [1,30625]:
> >> > > [1,54675]
>
> >> > > At this point, I'm not sure what to do next. Suggestions?  It looks
> >> > > like you were the creator of the CDF--is it the right one for this?
>
> >> > > -Randy
>
> >> > > On Jun 19 2009, 10:08 pm, Mark Robinson <mrobin...@wehi.edu.au> wrote:
> >> > >> Hi Steve.
>
> >> > >> I don't know how common this is.  Basically, a colleague found agene
> >> > >> that was very differentially expressed when analyzing using the
> >> > >> Affymetrix probesets definition and found virtually nothing when
> >> > >> using
> >> > >> the custom CDF that bundles all the probes for agenetogether.  The
> >> > >> reason was simple.  There were several probesets designed for this
> >> > >> geneand presumably they measure different isoforms.  The probes for
> >> > >> the DE probeset showed the difference, but all the other probesets
> >> > >> didn't.  When you use a robust linear model like RMA, outliers get
> >> > >> downweighted.  Because the DE probes accounted for a small proportion
> >> > >> of the probes (I think there was 3 or 4 other probesets at this
> >> > >> locus), their effect got washed out.
>
> >> > >> So, its a tradeoff.  Sometimes (perhaps most of the time) you gain by
> >> > >> lumping them all together ... more information, more power to detect
> >> > >> changes.  But, sometimes (perhaps rarely) it can mislead.  I'm sure
> >> > >> I'm not the only one to observe such things.  The probe-level data
> >> > >> (usually?) doesn't lie.  But, since you are comparing across
> >> > >> platforms, you will undoubtedly find this as you go along.  Different
> >> > >> microarray designs often measure slightly different things.
>
> >> > >> One other thing.  Be sure to convert your CDF to binary if it is not
> >> > >> already using affxparser's convertCdf().  Having this info stored in
> >> > >> binary format will make the processing much faster.  I think the MBNI
> >> > >> custom CDFs are text.
>
> >> > >> Cheers,
> >> > >> Mark
>
> >> > >> On 20/06/2009, at 6:55 AM, Steve P wrote:
>
> >> > >>> Mark,
>
> >> > >>> Thanks for the information. That is very helpful.
>
> >> > >>> I want to do the latter, which is to "combine probesets such that
> >> > >>> all
> >> > >>> probes for a givengene(by some definition -- RefSeq, Ensembl, etc)
> >> > >>> are used to arise at the summarize value."
>
> >> > >>> I was able to obtain a custom CDF for the U133-A array. So I will
> >> > >>> try
> >> > >>> that approach. But part of the reason I want to do this is to be
> >> > >>> able
> >> > >>> to compare values across platforms, so I may need to find/build a
> >> > >>> custom CDF for the other platform.
>
> >> > >>> I would appreciate any cautionary advice you have about
> >> > >>> summarizing at
> >> > >>> thegenelevel.
>
> >> > >>> Regards,
> >> > >>> -Steve
>
> >> > >>> On Jun 17, 9:56 am, Steve Piccolo <steve.picc...@gmail.com> wrote:
> >> > >>>> Yesterday I posted this question to the list, but the spam blocker
> >> > >>>> didn't
> >> > >>>> let it through. Below my question is a response from Mark Robinson.
>
> >> > >>>> ---------------------------------------------------------------------------
> >> > >>>>  -----------------------------------
>
> >> > >>>> Following the example provided 
> >> > >>>> athttp://groups.google.com/group/aroma-affymetrix/web/gene-1-0-st-array
> >> > >>>> ...
> >> > >>>> ,
> >> > >>>> I am running the following code:
>
> >> > >>>> chipType <- "HT_HG-U133A"
> >> > >>>> dataSet = "myData"
>
> >> > >>>> library(aroma.affymetrix)
> >> > >>>> verbose <- Arguments$getVerbose(-8, timestamp=TRUE)
>
> >> > >>>> cdf <- AffymetrixCdfFile$byChipType(chipType)
> >> > >>>> cs <- AffymetrixCelSet$byName(dataSet, cdf=cdf)
>
> >> > >>>> bc <- RmaBackgroundCorrection(cs)
> >> > >>>> csBC <- process(bc,verbose=verbose)
> >> > >>>> qn <- QuantileNormalization(csBC)
> >> > >>>> csN <- process(qn, verbose=verbose)
>
> >> > >>>> plm <- RmaPlm(csN)
> >> > >>>> fit(plm, verbose=verbose)
>
> >> > >>>> ces <- getChipEffectSet(plm)
> >> > >>>> gExprs <- extractDataFrame(ces, units=NULL,...
>
> read more »

-- 
When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
version of the package, 2) to report the output of sessionInfo() and 
traceback(), and 3) to post a complete code example.


You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group.
To post to this group, send email to aroma-affymetrix@googlegroups.com
To unsubscribe from this group, send email to 
aroma-affymetrix-unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/aroma-affymetrix?hl=en

[aroma.affymetrix] Re: Gene-Level Summarization of Expression Data

Reply via email to