> Hi Mark: > > After I have obtained the statistically significant differentially > expression(DE) exons between the two groups using limma, I want further to > explore the splicing pattern of the DE exons in each group. As I know, for > each array, FIRMA scores each exon as to whether its probes systematically > deviate from the expected gene expression level. Is there any way to get a > summary of firma score for each exon across all the arrays belonging to > the > same group? Thanks!
How about the average of the FIRMA scores as a summary? Mark > > Xinjun > > On Mon, May 4, 2009 at 7:40 AM, Mark Robinson <mrobin...@wehi.edu.au> > wrote: > >> >> Hi Xinjun. >> >> Here, 'unitName' is the transcript cluster id and 'groupName' is the >> probeset id, as defined by Affymetrix. The 'unit', 'group' and 'cell' >> columns are indices and you may not need these. To find out what the >> unitName and groupName correspond to, I would consult the Affymetrix >> annotation files. >> >> (Assuming Human Exon 1.0 ST), if you go to: >> >> http://www.affymetrix.com/products_services/arrays/specific/exon.affx#1_4 >> >> and find the section "Current NetAffx Annotation Files" you'll find 2 >> CSV files that you can download, one for transcript clusters and one >> for probesets. These should give you genome coordinates, Genbank/ >> RefSeq/Ensembl identifiers, gene symbols, etc. >> >> Hope that helps. >> Mark >> >> >> On 04/05/2009, at 12:02 AM, Xinjun Zhang wrote: >> >> > Hi: >> > >> > Thanks very much for your great help. But I still have difficulty in >> > understanding the first 5 columns of fsDF, as you have taken as an >> > example: >> > >> > > head(fsDF[,1:5]) >> > unitName groupName unit group cell >> > 1 2315251 2315252 1 1 1 >> > 2 2315251 2315253 1 2 2 >> > 3 2315373 2315374 2 1 3 >> > 4 2315373 2315375 2 2 4 >> > 5 2315373 2315376 2 3 5 >> > 6 2315373 2315377 2 4 6 >> > >> > What does each column, especially the unitName and groupName mean ? >> > And how can I correlate unitName and groupName to gene name and exon >> > number? Thanks in advance! >> > >> > Xinjun >> > >> > On Thu, Apr 30, 2009 at 3:31 PM, Mark Robinson >> > <mrobin...@wehi.edu.au> wrote: >> > >> > Hi Xinjun. >> > >> > Comments below. >> > >> > >> > On 30/04/2009, at 12:33 AM, Xinjun Zhang wrote: >> > >> > > Hi: >> > > >> > > Sorry for the second ambiguous question. Now I will give it out in >> > > another way: >> > > >> > > After limma analysis: >> > > >> > > # two groups: CEU and YRI >> > > >design >> > > CEU YRIvsCEU >> > > GSM188687 1 0 >> > > GSM188688 1 0 >> > > GSM188861 1 1 >> > > GSM188862 1 1 >> > > >> > > #Limma >> > > fsDF <- extractDataFrame(fs, addNames=TRUE) >> > > fsDF[,-c(1:5)] <- log2(fsDF[,-c(1:5)]) >> > > fit<-lmFit(fsDF[,-c(1:5)],design) >> > > fit<-eBayes(fit) >> > > fit$genes<-fsDF[,1] >> > > topTable(fit, coef=NULL, number = 10, adjust="BH") >> > > >> > > Then I got output in RGui in this form: >> > > >> > > > topTable(fit,coef="YRIvsCEU",adjust="BH") >> > > ID logFC t P.Value adj.P.Val B >> > > 248067 3851537 10.781228 18.01769 5.069965e-07 0.1441163 >> > > -4.159301 ## In this line, is "248067 " a NCBI geneid d and >> > > "3851537" a probeset id? And if the first coloum is gene ID, what >> > > is strange to me is that some of the ID is not a human gene id. >> > > 219150 3721400 -12.364204 -14.91257 1.798048e-06 0.2146088 -4.164325 >> > > 90041 2903401 -8.915503 -13.79270 3.021449e-06 0.2146088 -4.166979 >> > > 80808 2836738 7.811150 13.45085 3.568320e-06 0.2146088 -4.167917 >> > > 250529 3862018 -7.935552 -13.33698 3.774934e-06 0.2146088 -4.168245 >> > > 176674 3462843 10.559478 12.92835 4.637400e-06 0.2158173 -4.169490 >> > > 224640 3744039 7.930627 12.66410 5.314668e-06 0.2158173 -4.170356 >> > > 134937 3224650 -9.385466 -12.18252 6.860763e-06 0.2437758 -4.172073 >> > > 155392 3352948 -6.802731 -11.71350 8.878365e-06 0.2804133 -4.173938 >> > > 104503 3003193 -8.947865 -11.50151 1.000676e-05 0.2844473 -4.174852 >> > >> > Be careful here. The first number you see here ("248067") is a row >> > number that limma puts in, and is not an gene/probeset identifier of >> > any kind. The "ID" column is what you have put in fit$genes (see >> > above, you have the command "fit$genes <- fsDF[,1]"). >> > >> > I would actually recommend that you put more in fit$genes, because >> > what you have have now is only the first column of the fsDF data >> > frame, which gives the transcript_cluster_id. So, you have the >> > transcript_cluster_id, but you don't know what probeset_id this >> > corresponds to. >> > >> > For example: >> > >> > > head(fsDF[,1:5]) >> > unitName groupName unit group cell >> > 1 2315251 2315252 1 1 1 >> > 2 2315251 2315253 1 2 2 >> > 3 2315373 2315374 2 1 3 >> > 4 2315373 2315375 2 2 4 >> > 5 2315373 2315376 2 3 5 >> > 6 2315373 2315377 2 4 6 >> > >> > Maybe it would be better to do something like: >> > >> > ... >> > fit$genes <- fsDF[,1:2] >> > >> > ... that way your output table will look something like: >> > >> > > topTable(fit,coef=1,n=2) >> > unitName groupName logFC t P.Value >> > adj.P.Val B >> > 4166 3581637 3582005 -2.581813 -12.65587 3.786135e-13 2.342177e-09 >> > 19.22332 >> > 4459 2656818 2656822 -2.951434 -12.54443 4.684822e-13 2.342177e-09 >> > 19.03846 >> > >> > >> > >> > >> > > > >> > > > topTable(fit,coef=NULL,adjust.method="BH") # what is the >> > > difference of the two forms of output? I mean, what does the column >> > > "CEU" and "YRIvsCEU" mean? >> > > ProbeID CEU YRIvsCEU F P.Value adj.P.Val >> > > 248067 3851537 -5.235567 10.781228 162.45279 1.705464e-06 0.4847866 >> > > 219150 3721400 6.152369 -12.364204 111.19496 6.042656e-06 0.7176463 >> > > 90041 2903401 4.511199 -8.915503 95.13301 1.013043e-05 0.7176463 >> > > 80808 2836738 -3.606537 7.811150 90.99296 1.173308e-05 0.7176463 >> > > 250529 3862018 3.970771 -7.935552 88.93762 1.265099e-05 0.7176463 >> > > 176674 3462843 -5.305275 10.559478 83.57312 1.552603e-05 0.7176463 >> > > 224640 3744039 -3.792070 7.930627 80.34278 1.767260e-05 0.7176463 >> > > 134937 3224650 4.691757 -9.385466 74.20693 2.292795e-05 0.8146732 >> > > 155392 3352948 3.399463 -6.802731 68.60307 2.962962e-05 0.9358188 >> > > 104503 3003193 4.306241 -8.947865 66.23524 3.322071e-05 0.9443152 >> > >> > >> > This is where you should read and understand what limma (and >> > specifically the topTable() routine) is doing. Briefly, when you say >> > coef=NULL, you are testing the significance of all parameters >> > simultaneously, which results in a moderated F-statistic. When you >> > specify a single column (e.g. coef="YRIvsCEU" as you have), you are >> > testing the significance of that single parameter, and you will get a >> > moderated t-statistic. >> > >> > >> > >> > > And when I take a quick look at the output produced by the function >> > > write.fit( ) : >> > > write.fit(fit,results=NULL, "limma_result.out",digits=3, >> > > adjust="none", method="separate") >> > > >> > > and the first line of the output file "limma_result.ou" is : >> > > >> > > Coef.CEU Coef.YRIvsCEU t.CEU t.YRIvsCEU p.value.CEU >> > > p.value.YRIvsCEU F F.p.value Genes >> > > -0.5 0.955 -1.08 1.46 0.31653 0.18906 1.07 >> > > 0.39541 2315251 >> > > 0.41 -0.632 0.88 -0.96 0.4071 0.36826 0.5 >> > > 0.62453 2315251 >> > > 0.168 -0.384 0.41 -0.66 0.69603 0.53134 0.22 >> > > 0.80748 2315373 >> > > -0.597 1.1 -1.44 1.88 0.19287 0.10287 1.78 >> > > 0.23803 2315373 >> > > 0.276 -0.396 0.68 -0.69 0.51883 0.51368 0.27 >> > > 0.76801 2315373 >> > > 0.118 -0.214 0.25 -0.32 0.80819 0.75689 0.05 >> > > 0.94922 2315373 >> > > -0.018 0.128 -0.04 0.21 0.96823 0.83876 0.03 >> > > 0.9667 2315554 >> > > -0.027 0.191 -0.06 0.31 0.95283 0.76783 0.07 >> > > 0.9317 2315554 >> > > 0.276 -0.629 0.65 -1.05 0.53693 0.33006 0.56 >> > > 0.59639 2315554 >> > > 0.405 -0.552 0.88 -0.85 0.40726 0.4232 0.44 >> > > 0.66023 2315554 >> > > 0.109 -0.199 0.23 -0.3 0.82444 0.77509 0.04 >> > > 0.95666 2315554 >> > > -0.318 0.771 -0.63 1.08 0.54789 0.31597 0.6 >> > > 0.57392 2315554 >> > > ..... >> > > >> > > It got only a colomn called "Genes" ( it is probe id, I guess). So >> > > how can I use this output to find out the differentially expressed >> > > exons( and the corresponding genes) in the two groups? Is it clearer >> > > to you? Thanks in advance! >> > > >> > > Xinjun >> > >> > >> > As I mentioned above, you should change what you put in fit$genes to >> > allow you to know what probeset_id each row corresponds to. I'm >> > guessing here, but you are probably interested in the "YRIvsCEU" >> > parameter, so you'd probably sort on the p.value.YRIvsCEU column ... >> > >> > Cheers, >> > Mark >> > >> > >> > >> > >> > > >> > > >> > > >> > > >> > > >> > > On Tue, Apr 28, 2009 at 4:29 PM, Mark Robinson >> > > <mrobin...@wehi.edu.au> wrote: >> > > >> > > Hi Xinjun. >> > > >> > > Comments below. >> > > >> > > >> > > On 28/04/2009, at 12:25 PM, Xinjun Zhang wrote: >> > > >> > > > Hi Mark: >> > > > >> > > > Thanks very much for your clarification! Now I have approached to >> > > > limma analysis of FIRMA score to get differentially spliced genes >> > > > ( and also splicing pattern of each ). But I still have some >> > > > difficulty to understand the code ( in red ) below in Limma >> > > analysis: >> > > > >> > > > #fs is the 'standard' FirmaSet-object >> > > > fsDF <- extractDataFrame(fs, addNames=TRUE) >> > > > fsDF[,-c(1:5)] <- log2(fsDF[,-c(1:5)]) # I know why >> > log2 >> > > > is here but confused by fsDF[,c(1:5)] .... what does this >> > expression >> > > > mean? >> > > >> > > >> > > Note that it is -c(1:5), meaning operate on (here, take logs) all of >> > > the columns except 1:5 ... that is, because extractDataFrame gives >> > > some extra columns at the beginning that are NOT data, we only >> > want to >> > > log that columns that have actual data. >> > > >> > > >> > > > design <- cbind(Grp1=1,Grp2=c(rep(0,n_1),rep(1,n_2))) >> > > > fit<-lmFit(fsDF[,-c(1:5)],design) >> > > > fit<-eBayes(fit) >> > > > fit$genes<-fsDF[,1] # Can I also get seperate splicing >> > > > patterns for the two differentially spliced genes from two group >> > > > (control and treatment )? >> > > >> > > >> > > I'm not sure what you are asking here. The probesets where the >> > "Grp2" >> > > coefficient is significantly different from 0 may highlight >> > > differentially spliced exons. Does that help? >> > > >> > > Mark >> > > >> > > >> > > >> > > >> > > >> > > > Thanks in advance! >> > > > >> > > > Xinjun >> > > > >> > > > On Mon, Apr 27, 2009 at 6:19 AM, Mark Robinson >> > > > <mrobin...@wehi.edu.au> wrote: >> > > > >> > > > >> > > > Hi Xinjun. >> > > > >> > > > Quick comments below. >> > > > >> > > > >> > > > > Hi Mark: >> > > > > >> > > > > Thanks very much for your help and I have have got a quick start >> > > > on a >> > > > > small >> > > > > dataset that each group (control and treatment ) contains 4 >> > > > arrays. I have >> > > > > set up a file structure like this: >> > > > > ================================================= >> > > > > rawDate/ >> > > > > controlGroup/ >> > > > > HuEx-1_0-st-v1/ >> > > > > GSMXXXXXX.CEL >> > > > > GSMXXXXXX.CEL >> > > > > ............................ >> > > > > >> > > > > treatmentGroup/ >> > > > > HuEx-1_0-st-v1/ >> > > > > GSMXXXXXX.CEL >> > > > > GSMXXXXXX.CEL >> > > > > ............................ >> > > > > ==================================================* >> > > > >> > > > >> > > > This setup will need to be changed. You will want to put ALL >> > > samples >> > > > together to do the PLM fitting, normalization, FIRMA scoring, etc. >> > > > >> > > > Something like: >> > > > >> > > > rawData/ >> > > > thisExperiment/ >> > > > HuEx-1_0-st-v1/ >> > > > sample1.CEL >> > > > sample2.CEL >> > > > ... >> > > > sampleN.CEL >> > > > >> > > > >> > > > > This is my code ( my questions are in red):* >> > > > > >> > > > > library(aroma.affymetrix) >> > > > > >> > > > > #Getting annotation data files >> > > > > chipType <- "HuEx-1_0-st-v1" >> > > > > cdf <- AffymetrixCdfFile$byChipType(chipType) >> > > > > print(cdf) >> > > > > >> > > > > #Defining CEL set >> > > > > cs <- AffymetrixCelSet$byName("controlGroup", cdf=cdf) >> > > > > print(cs) >> > > > > >> > > > > #Background Adjustment and Normalization >> > > > > bc <- RmaBackgroundCorrection(cs) >> > > > > csBC <- process(bc,verbose=verbose) >> > > > > >> > > > > #quantile normalization >> > > > > qn <- QuantileNormalization(csBC, typesToUpdate="pm") ### I set >> > > the >> > > > > second >> > > > > parameter as "pm" as the chip type is Affymetrix exon array, is >> > > that >> > > > > right? >> > > > > print(qn) >> > > > > csN <- process(qn, verbose=verbose) >> > > > >> > > > This is fine. >> > > > >> > > > > >> > > > > #Summarization >> > > > > getCdf(csN) >> > > > > ## * Fit exon-by-exon*, change the value of mergeGroups to FALSE >> > > > in the >> > > > > ExonRmaPlm() call above. >> > > > > *plmEx *<- ExonRmaPlm(csN, mergeGroups=*FALSE*) >> > > > > print(*plmEx*) >> > > > > #To fit the PLM to all of the data, do: >> > > > > fit(*plmEx*, verbose=verbose) >> > > > > * >> > > > > And here is my problem:* >> > > > > firma <- FirmaModel(plmTr) # I have noticd that FIRMA analysis >> > > > ONLY works >> > > > > from the PLM based on transcripts. So when the parameter is >> > > plmTr, I >> > > > > wonder >> > > > > how can it detect the splicing events of genes ? Should not the >> > > > parameter >> > > > > be >> > > > > plmEx? >> > > > > fit(firma, verbose=verbose) >> > > > > fs <- getFirmaScores(firma) >> > > > >> > > > Like it says on the group web page for Exon arrays: "The FIRMA >> > > > analysis >> > > > ONLY works from the PLM based on transcripts". This is NOT an >> > > error. >> > > > That's the way it works. >> > > > >> > > > The manuscript gives more details for why this is the case: >> > > > >> http://bioinformatics.oxfordjournals.org/cgi/content/abstract/24/15/1707 >> > > > >> > > > Hope that helps. >> > > > >> > > > Cheers, >> > > > Mark >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > > >> > > > > >> > > > > On Fri, Apr 24, 2009 at 5:24 PM, Mark Robinson >> > > > > <mrobin...@wehi.edu.au>wrote: >> > > > > >> > > > >> >> > > > >> Hi Xinjun. >> > > > >> >> > > > >> Here is a quick sketch of what I might do. >> > > > >> >> > > > >> 1. Run everything to get FIRMA scores. See group page for >> > > running >> > > > >> details and the Purdom Bioinformatics 2008 paper for >> > > methodological >> > > > >> details. >> > > > >> >> > > > >> 2a. If Nn or Nc > 1, use 'limma' to look for a difference in >> > > FIRMA >> > > > >> scores between your two groups. See threads: >> > > > >> >> > > > >> >> http://groups.google.com/group/aroma-affymetrix/browse_thread/thread/36d8c59d742fc503/ >> > > > >> >> > > > >> >> http://groups.google.com/group/aroma-affymetrix/browse_thread/thread/7d2645bd76cc2023/ >> > > > >> >> > > > >> 2b. If you have say patient samples (and a good number of >> > them), >> > > > you >> > > > >> might expect only a subset of your C or N patients to have a >> > > > splicing >> > > > >> aberration. In this case, maybe you just want to look for >> > large- >> > > > in- >> > > > >> magnitude FIRMA scores. >> > > > >> >> > > > >> ... maybe you also want to look at interesting probesets via >> > > > >> GenomeGraphs: >> > > > >> >> > > > >> >> http://groups.google.com/group/aroma-affymetrix/web/using-the-genomegraphs-package-with-firma >> > > > >> >> > > > >> Cheers, >> > > > >> Mark >> > > > >> >> > > > >> >> > > > >> On 24/04/2009, at 1:01 AM, liszt wrote: >> > > > >> >> > > > >> > >> > > > >> > Hi: >> > > > >> > >> > > > >> > Now I have N CEL files from both normal and cancer tissue. >> > The >> > > > two >> > > > >> > groups contains Nn and Nc CEL files separately (Nn = Nc). I >> > > > want to >> > > > >> > investigate the difference in gene's splicing pattern >> > between >> > > > normal >> > > > >> > tissue and cancerous tissue. Would you give me some tips? ( I >> > > > have >> > > > >> > read the document and got no answers) Thanks! >> > > > >> > >> > > > >> > Xinjun >> > > > >> > > >> > > > >> >> > > > >> ------------------------------ >> > > > >> Mark Robinson >> > > > >> Epigenetics Laboratory, Garvan >> > > > >> Bioinformatics Division, WEHI >> > > > >> e: m.robin...@garvan.org.au >> > > > >> e: mrobin...@wehi.edu.au >> > > > >> p: +61 (0)3 9345 2628 >> > > > >> f: +61 (0)3 9347 0852 >> > > > >> ------------------------------ >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> > >> > > > >> >> > > > > >> > > > > > >> > > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > > >> > > > >> > > >> > > ------------------------------ >> > > Mark Robinson >> > > Epigenetics Laboratory, Garvan >> > > Bioinformatics Division, WEHI >> > > e: m.robin...@garvan.org.au >> > > e: mrobin...@wehi.edu.au >> > > p: +61 (0)3 9345 2628 >> > > f: +61 (0)3 9347 0852 >> > > ------------------------------ >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > > >> > > >> > >> > ------------------------------ >> > Mark Robinson >> > Epigenetics Laboratory, Garvan >> > Bioinformatics Division, WEHI >> > e: m.robin...@garvan.org.au >> > e: mrobin...@wehi.edu.au >> > p: +61 (0)3 9345 2628 >> > f: +61 (0)3 9347 0852 >> > ------------------------------ >> > >> > >> > >> > >> > >> > >> > >> > >> > > >> > >> >> ------------------------------ >> Mark Robinson >> Epigenetics Laboratory, Garvan >> Bioinformatics Division, WEHI >> e: m.robin...@garvan.org.au >> e: mrobin...@wehi.edu.au >> p: +61 (0)3 9345 2628 >> f: +61 (0)3 9347 0852 >> ------------------------------ >> >> >> >> >> >> > >> > > > > --~--~---------~--~----~------------~-------~--~----~ When reporting problems on aroma.affymetrix, make sure 1) to run the latest version of the package, 2) to report the output of sessionInfo() and traceback(), and 3) to post a complete code example. You received this message because you are subscribed to the Google Groups "aroma.affymetrix" group. To post to this group, send email to aroma-affymetrix@googlegroups.com To unsubscribe from this group, send email to aroma-affymetrix-unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/aroma-affymetrix?hl=en -~----------~----~----~----~------~----~------~--~---