[aroma.affymetrix] Re: annotation of ST gene Arrays

Mark Robinson Wed, 16 Dec 2009 16:48:52 -0800

Hi Wade.

I think the problem lies with the 'ragene10st*probeset*.db' library.


How about trying the symbols from the 'ragene10sttranscriptcluster.db'  
package:
http://www.bioconductor.org/packages/release/data/annotation/html/ragene10sttranscriptcluster.db.html

I can't remember when this change was made, but my 'hugene10st.db'  
example is now outdated.  You should use hugene10stprobeset.db for  
probesets or hugene10sttranscriptcluster.db for transcript clusters.

Hope that helps.

Cheers,
Mark

On 17-Dec-09, at 10:09 AM, Wade D wrote:

> Hi Mark and others,
> I am in a somewhat similar as the original person who started this
> discussion, so I am tacking on my question to your response from
> February.
>
> This is my first ST analysis, and I am using the Rat gene 1.0 ST. I
> followed the example given at
>     
> http://groups.google.com/group/aroma-affymetrix/web/gene-1-0-st-array-analysis
> and everything has worked fine so far.
>
> Now, I would like to annotate my gene-level summaries. I tried using
> methods I typically do (from the annotate package) with
> ragene10stprobeset.db, but things didn't seem right. So I figured it
> was me, and I came back the group help pages and found your post.
> Mimicking it below, it seems that I've either done something wrong, or
> there is a problem with ragene10stprobeset.db.
>
>> library(ragene10stprobeset.db)
>> symbols <- unlist(as.list(ragene10stprobesetSYMBOL))
>> myids<-gExprs[,1]
>> head(myids)
> [1] "10700001" "10700003" "10700004" "10700005" "10700013" "10700014"
>>
>> temp<-data.frame(affyid = myids,symbol = symbols[myids])
>> #temp[!is.na(temp$symbol),]
>>
>> sum(!is.na(temp$symbol))
> [1] 237
>
> This is a disturbingly low number, so I figure something is amiss.
> Following your lead, I compare the CDF with what is on Affy's website
> in the transcript and probeset files...
>
>> tr <- read.csv("RaGene-1_0-st- 
>> v1.na30.1.rn4.transcript.csv",header=TRUE,comment.char="#")
>> ps <- read.csv("RaGene-1_0-st- 
>> v1.na30.rn4.probeset.csv",header=TRUE,comment.char="#")
>> #chipType <- "RaGene-1_0-st-v1"
>> #cdf <- AffymetrixCdfFile$byChipType(chipType, tags="r3")
>> un <- getUnitNames(cdf)
>> sum( un %in% ps$transcript_cluster_id )
> [1] 27342
>>
>> sum( un %in% tr$transcript_cluster_id )
> [1] 29169
>
> Everything looks reasonable here.
>
>> sum(names(symbols) %in% ps$transcript_cluster_id )
> [1] 0
>> sum(names(symbols) %in% tr$transcript_cluster_id )
> [1] 1872
>
> This is the problem it seems.
>
> I wanted to ask others before I build my own annotation.db for
> ragene10st. I've done it for Illumina arrays before, but it has been
> awhile, and it is a little bit of a pain for Windows users to do. Just
> wanted to get a second opinion before I go down that road, especially
> since this is my first time dealing with ST arrays.
>
> Thanks,
> Wade
>
>
>
>
> On Feb 10, 3:13 am, Mark Robinson <mrobin...@wehi.edu.au> wrote:
>> Hi Simon.
>>
>> See comments below.
>>
>>> I am using the mouse  gene ST arrays and am having problems with
>>> annotation. When i write a csv file, theannotationis only the
>>> probeset_id, no gene names or accession numbers etc.
>>
>> That's what it should be.  Actually, its the 'transcript_cluster_id'.
>> Previously, Affy did not provideannotationat the "probeset" level.
>>
>> The CDF file just contains the identifiers.  Linking results (e.g.
>> expression summaries) to theannotationcan be done with other R
>> packages.  For example, here is some code I gave Sebastien a few  
>> weeks
>> ago that will get you started (just replace hugene10st.db with
>> mogene10st.db):
>>
>> -------------
>> Say you have some Affy identifiers:
>>
>>  > myids
>> [1] "7950136" "7955845" "7955852" "7955855" "7955858" "7955865"
>> "7955869"
>> [8] "7955873" "7955887" "8016433"
>>
>> Load package and read off the gene symbols:
>>
>>  > library(hugene10st.db)
>>  > symbols <- unlist(as.list(hugene10stSYMBOL))
>>  > data.frame(affyid = myids,symbol = symbols[myids])
>>          affyid symbol
>> 7950136 7950136 PHOX2A
>> 7955845 7955845 HOXC13
>> 7955852 7955852 HOXC12
>> 7955855 7955855 HOXC11
>> 7955858 7955858 HOXC10
>> 7955865 7955865  HOXC9
>> 7955869 7955869  HOXC8
>> 7955873 7955873  HOXC6
>> 7955887 7955887  HOXC5
>> 8016433 8016433  HOXB1
>>
>> Here are some other fields in hugene10st.db:
>>
>>  > hugene10st
>> hugene10st               hugene10stCHRLENGTHS
>> hugene10stENTREZID       hugene10stGO2ALLPROBES   hugene10stORGANISM
>> hugene10stPMID2PROBE     hugene10stUNIPROT
>> hugene10st.db::          hugene10stCHRLOC
>> hugene10stENZYME         hugene10stGO2PROBE       hugene10stPATH
>> hugene10stPROSITE        hugene10st_dbInfo
>> hugene10stACCNUM         hugene10stCHRLOCEND
>> hugene10stENZYME2PROBE   hugene10stMAP             
>> hugene10stPATH2PROBE
>> hugene10stREFSEQ         hugene10st_dbconn
>> hugene10stALIAS2PROBE    hugene10stENSEMBL
>> hugene10stGENENAME       hugene10stMAPCOUNTS      hugene10stPFAM
>> hugene10stSYMBOL         hugene10st_dbfile
>> hugene10stCHR            hugene10stENSEMBL2PROBE
>> hugene10stGO             hugene10stOMIM           hugene10stPMID
>> hugene10stUNIGENE        hugene10st_dbschema
>>
>> ...
>> -------------
>>
>>> These probesets
>>> also do not match the probeset_ids from MoGene-1_0-st-v1.na27.mm9  
>>> off
>>> the affymetrix website.
>>
>> Perhaps you want 'transcript_cluster_id's?
>>
>> (CSV files 
>> fromhttp://www.affymetrix.com/products_services/arrays/specific/mousegene...)
>>
>>  > tr <- read.csv("MoGene-1_0-st-
>> v1.na27.mm9.transcript.csv",header=TRUE,comment.char="#")
>>  > ps <- read.csv("MoGene-1_0-st-
>> v1.na27.mm9.probeset.csv",header=TRUE,comment.char="#")
>>
>>  > cdf <- AffymetrixCdfFile$fromChipType("MoGene-1_0-st-
>> v1",verbose=verbose)
>>  > un <- getUnitNames(cdf)
>>  > sum( un %in% ps$transcript_cluster_id )
>> [1] 28815
>>  > sum( un %in% tr$transcript_cluster_id )
>> [1] 35474
>>
>> You may also be interested in the following thread, which explains  
>> the
>> difference in number of 
>> probesets:http://thread.gmane.org/gmane.science.biology.informatics.conductor/1
>>  
>> ...
>>
>>
>>
>>> here is my session:
>>
>>>> library('aroma.affymetrix')
>>>> cdf <- AffymetrixCdfFile$byChipType("MoGene-1_0-st-v1",tags='r3')
>>>> cs <- AffymetrixCelSet$byName("Files", cdf=cdf)
>>>> bc <- RmaBackgroundCorrection(cs)
>>>> csBC <- process(bc,verbose=verbose)
>>>> qn <- QuantileNormalization(csBC, typesToUpdate="pm")
>>>> csN <- process(qn, verbose=verbose)
>>>> plm <- RmaPlm(csN)
>>>> fit(plm, verbose=verbose)
>>>> qam <- QualityAssessmentModel(plm)
>>>> ces <- getChipEffectSet(plm)
>>>> mat <- extractMatrix(ces)
>>>> mat <- log2(mat)
>>>> rownames(mat) <- getUnitNames(cdf)
>>>> write.csv(mat, file="data.csv")
>>
>>> I am sure there is a simple solution to this and I apologize as I am
>>> new to "R". Any help would be much appreciated. Also, what are  
>>> people
>>> opinions on the "positive" and "negative controls" probesets? Should
>>> these be included as part of a final gene list?
>>> Thank you in advance for any help.
>>
>> Good question.  Some people use the controls for QC and some use them
>> for adjusting for background (for example, the pool of GC content
>> probes).  But, definitely if you were to follow this up with some  
>> kind
>> of differential expression analysis (e.g. limma), I would discard the
>> non-"main" probes.  For example:
>>
>>  > table(tr$category)
>>
>>              control->affx control->bgp-
>>  >antigenomic                      main
>>                         22
>> 45                     28815
>>             normgene->exon          normgene->intron  rescue->FLmRNA-
>>  >unmapped
>>                       1324
>> 5222                        91
>>
>> Hope that helps.
>> Mark
>>
>> ------------------------------
>> Mark Robinson
>> Epigenetics Laboratory, Garvan
>> Bioinformatics Division, WEHI
>> e: m.robin...@garvan.org.au
>> e: mrobin...@wehi.edu.au
>> p: +61 (0)3 9345 2628
>> f: +61 (0)3 9347 0852
>> ------------------------------

------------------------------
Mark Robinson, PhD (Melb)
Epigenetics Laboratory, Garvan
Bioinformatics Division, WEHI
e: m.robin...@garvan.org.au
e: mrobin...@wehi.edu.au
p: +61 (0)3 9345 2628
f: +61 (0)3 9347 0852
------------------------------






______________________________________________________________________
The information in this email is confidential and intended solely for the 
addressee.
You must not disclose, forward, print or use it without the permission of the 
sender.
______________________________________________________________________

-- 
When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
version of the package, 2) to report the output of sessionInfo() and 
traceback(), and 3) to post a complete code example.


You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group.
To post to this group, send email to aroma-affymetrix@googlegroups.com
To unsubscribe from this group, send email to 
aroma-affymetrix-unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/aroma-affymetrix?hl=en

[aroma.affymetrix] Re: annotation of ST gene Arrays

Reply via email to