[aroma.affymetrix] Re: Compare the splicing pattern of two samples

Mark Robinson Wed, 06 May 2009 15:32:50 -0700

> Hi Mark:
>
> After I have obtained the statistically significant differentially
> expression(DE) exons between the two groups using limma, I want further to
> explore the splicing pattern of the DE exons in each group. As I know, for
> each array, FIRMA scores each exon as to whether its probes systematically
> deviate from the expected gene expression level. Is there any way to get a
> summary of firma score for each exon across all the arrays belonging to
> the
> same group? Thanks!



How about the average of the FIRMA scores as a summary?

Mark


>
> Xinjun
>
> On Mon, May 4, 2009 at 7:40 AM, Mark Robinson <mrobin...@wehi.edu.au>
> wrote:
>
>>
>> Hi Xinjun.
>>
>> Here, 'unitName' is the transcript cluster id and 'groupName' is the
>> probeset id, as defined by Affymetrix.  The 'unit', 'group' and 'cell'
>> columns are indices and you may not need these.  To find out what the
>> unitName and groupName correspond to, I would consult the Affymetrix
>> annotation files.
>>
>> (Assuming Human Exon 1.0 ST), if you go to:
>>
>> http://www.affymetrix.com/products_services/arrays/specific/exon.affx#1_4
>>
>> and find the section "Current NetAffx Annotation Files" you'll find 2
>> CSV files that you can download, one for transcript clusters and one
>> for probesets.  These should give you genome coordinates, Genbank/
>> RefSeq/Ensembl identifiers, gene symbols, etc.
>>
>> Hope that helps.
>> Mark
>>
>>
>> On 04/05/2009, at 12:02 AM, Xinjun Zhang wrote:
>>
>> > Hi:
>> >
>> > Thanks very much for your great help. But I still have difficulty in
>> > understanding the first 5 columns of fsDF, as you have taken as an
>> > example:
>> >
>> > > head(fsDF[,1:5])
>> >   unitName groupName unit group cell
>> > 1  2315251   2315252    1     1    1
>> > 2  2315251   2315253    1     2    2
>> > 3  2315373   2315374    2     1    3
>> > 4  2315373   2315375    2     2    4
>> > 5  2315373   2315376    2     3    5
>> > 6  2315373   2315377    2     4    6
>> >
>> > What does each column, especially the unitName and groupName mean ?
>> > And how can I correlate unitName and groupName to gene name and exon
>> > number? Thanks in advance!
>> >
>> > Xinjun
>> >
>> > On Thu, Apr 30, 2009 at 3:31 PM, Mark Robinson
>> > <mrobin...@wehi.edu.au> wrote:
>> >
>> > Hi Xinjun.
>> >
>> > Comments below.
>> >
>> >
>> > On 30/04/2009, at 12:33 AM, Xinjun Zhang wrote:
>> >
>> > > Hi:
>> > >
>> > > Sorry for the second ambiguous question. Now I will give it out in
>> > > another way:
>> > >
>> > > After limma analysis:
>> > >
>> > > # two groups: CEU and YRI
>> > > >design
>> > >                  CEU    YRIvsCEU
>> > > GSM188687    1    0
>> > > GSM188688    1    0
>> > > GSM188861    1    1
>> > > GSM188862    1    1
>> > >
>> > > #Limma
>> > > fsDF <- extractDataFrame(fs, addNames=TRUE)
>> > > fsDF[,-c(1:5)] <- log2(fsDF[,-c(1:5)])
>> > > fit<-lmFit(fsDF[,-c(1:5)],design)
>> > > fit<-eBayes(fit)
>> > > fit$genes<-fsDF[,1]
>> > > topTable(fit, coef=NULL, number = 10, adjust="BH")
>> > >
>> > > Then I got output in RGui in this form:
>> > >
>> > > > topTable(fit,coef="YRIvsCEU",adjust="BH")
>> > >             ID      logFC         t      P.Value adj.P.Val         B
>> > > 248067 3851537  10.781228  18.01769 5.069965e-07 0.1441163
>> > > -4.159301        ## In this line, is "248067 " a NCBI geneid d and
>> > > "3851537" a probeset id?  And if the first coloum is gene ID, what
>> > > is strange to me is that  some of the ID is not a human gene id.
>> > > 219150 3721400 -12.364204 -14.91257 1.798048e-06 0.2146088 -4.164325
>> > > 90041  2903401  -8.915503 -13.79270 3.021449e-06 0.2146088 -4.166979
>> > > 80808  2836738   7.811150  13.45085 3.568320e-06 0.2146088 -4.167917
>> > > 250529 3862018  -7.935552 -13.33698 3.774934e-06 0.2146088 -4.168245
>> > > 176674 3462843  10.559478  12.92835 4.637400e-06 0.2158173 -4.169490
>> > > 224640 3744039   7.930627  12.66410 5.314668e-06 0.2158173 -4.170356
>> > > 134937 3224650  -9.385466 -12.18252 6.860763e-06 0.2437758 -4.172073
>> > > 155392 3352948  -6.802731 -11.71350 8.878365e-06 0.2804133 -4.173938
>> > > 104503 3003193  -8.947865 -11.50151 1.000676e-05 0.2844473 -4.174852
>> >
>> > Be careful here.  The first number you see here ("248067") is a row
>> > number that limma puts in, and is not an gene/probeset identifier of
>> > any kind.  The "ID" column is what you have put in fit$genes (see
>> > above, you have the command "fit$genes <- fsDF[,1]").
>> >
>> > I would actually recommend that you put more in fit$genes, because
>> > what you have have now is only the first column of the fsDF data
>> > frame, which gives the transcript_cluster_id.  So, you have the
>> > transcript_cluster_id, but you don't know what probeset_id this
>> > corresponds to.
>> >
>> > For example:
>> >
>> >  > head(fsDF[,1:5])
>> >   unitName groupName unit group cell
>> > 1  2315251   2315252    1     1    1
>> > 2  2315251   2315253    1     2    2
>> > 3  2315373   2315374    2     1    3
>> > 4  2315373   2315375    2     2    4
>> > 5  2315373   2315376    2     3    5
>> > 6  2315373   2315377    2     4    6
>> >
>> > Maybe it would be better to do something like:
>> >
>> > ...
>> > fit$genes <- fsDF[,1:2]
>> >
>> > ... that way your output table will look something like:
>> >
>> >  > topTable(fit,coef=1,n=2)
>> >      unitName groupName     logFC         t      P.Value
>> > adj.P.Val        B
>> > 4166  3581637   3582005 -2.581813 -12.65587 3.786135e-13 2.342177e-09
>> > 19.22332
>> > 4459  2656818   2656822 -2.951434 -12.54443 4.684822e-13 2.342177e-09
>> > 19.03846
>> >
>> >
>> >
>> >
>> > > >
>> > > > topTable(fit,coef=NULL,adjust.method="BH") # what is the
>> > > difference of the two forms of output? I mean, what does the column
>> > > "CEU" and "YRIvsCEU" mean?
>> > >        ProbeID       CEU   YRIvsCEU         F      P.Value adj.P.Val
>> > > 248067 3851537 -5.235567  10.781228 162.45279 1.705464e-06 0.4847866
>> > > 219150 3721400  6.152369 -12.364204 111.19496 6.042656e-06 0.7176463
>> > > 90041  2903401  4.511199  -8.915503  95.13301 1.013043e-05 0.7176463
>> > > 80808  2836738 -3.606537   7.811150  90.99296 1.173308e-05 0.7176463
>> > > 250529 3862018  3.970771  -7.935552  88.93762 1.265099e-05 0.7176463
>> > > 176674 3462843 -5.305275  10.559478  83.57312 1.552603e-05 0.7176463
>> > > 224640 3744039 -3.792070   7.930627  80.34278 1.767260e-05 0.7176463
>> > > 134937 3224650  4.691757  -9.385466  74.20693 2.292795e-05 0.8146732
>> > > 155392 3352948  3.399463  -6.802731  68.60307 2.962962e-05 0.9358188
>> > > 104503 3003193  4.306241  -8.947865  66.23524 3.322071e-05 0.9443152
>> >
>> >
>> > This is where you should read and understand what limma (and
>> > specifically the topTable() routine) is doing.  Briefly, when you say
>> > coef=NULL, you are testing the significance of all parameters
>> > simultaneously, which results in a moderated F-statistic.  When you
>> > specify a single column (e.g. coef="YRIvsCEU" as you have), you are
>> > testing the significance of that single parameter, and you will get a
>> > moderated t-statistic.
>> >
>> >
>> >
>> > > And when I take a quick look at the output produced by the function
>> > > write.fit( ) :
>> > > write.fit(fit,results=NULL, "limma_result.out",digits=3,
>> > > adjust="none",  method="separate")
>> > >
>> > > and the first line of the output file "limma_result.ou" is :
>> > >
>> > > Coef.CEU    Coef.YRIvsCEU    t.CEU    t.YRIvsCEU    p.value.CEU
>> > > p.value.YRIvsCEU    F    F.p.value    Genes
>> > > -0.5    0.955    -1.08    1.46    0.31653    0.18906    1.07
>> > > 0.39541    2315251
>> > > 0.41    -0.632    0.88    -0.96    0.4071    0.36826    0.5
>> > > 0.62453    2315251
>> > > 0.168    -0.384    0.41    -0.66    0.69603    0.53134    0.22
>> > > 0.80748    2315373
>> > > -0.597    1.1    -1.44    1.88    0.19287    0.10287    1.78
>> > > 0.23803    2315373
>> > > 0.276    -0.396    0.68    -0.69    0.51883    0.51368    0.27
>> > > 0.76801    2315373
>> > > 0.118    -0.214    0.25    -0.32    0.80819    0.75689    0.05
>> > > 0.94922    2315373
>> > > -0.018    0.128    -0.04    0.21    0.96823    0.83876    0.03
>> > > 0.9667    2315554
>> > > -0.027    0.191    -0.06    0.31    0.95283    0.76783    0.07
>> > > 0.9317    2315554
>> > > 0.276    -0.629    0.65    -1.05    0.53693    0.33006    0.56
>> > > 0.59639    2315554
>> > > 0.405    -0.552    0.88    -0.85    0.40726    0.4232    0.44
>> > > 0.66023    2315554
>> > > 0.109    -0.199    0.23    -0.3    0.82444    0.77509    0.04
>> > > 0.95666    2315554
>> > > -0.318    0.771    -0.63    1.08    0.54789    0.31597    0.6
>> > > 0.57392    2315554
>> > > .....
>> > >
>> > > It got only a colomn called "Genes" ( it is probe id, I guess). So
>> > > how can I use this output to find out the differentially expressed
>> > > exons( and the corresponding genes) in the two groups? Is it clearer
>> > > to you? Thanks in advance!
>> > >
>> > > Xinjun
>> >
>> >
>> > As I mentioned above, you should change what you put in fit$genes to
>> > allow you to know what probeset_id each row corresponds to.  I'm
>> > guessing here, but you are probably interested in the "YRIvsCEU"
>> > parameter, so you'd probably sort on the p.value.YRIvsCEU column ...
>> >
>> > Cheers,
>> > Mark
>> >
>> >
>> >
>> >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Tue, Apr 28, 2009 at 4:29 PM, Mark Robinson
>> > > <mrobin...@wehi.edu.au> wrote:
>> > >
>> > > Hi Xinjun.
>> > >
>> > > Comments below.
>> > >
>> > >
>> > > On 28/04/2009, at 12:25 PM, Xinjun Zhang wrote:
>> > >
>> > > > Hi Mark:
>> > > >
>> > > > Thanks very much for your clarification! Now I have approached to
>> > > > limma analysis of FIRMA score to get differentially spliced genes
>> > > > ( and also splicing pattern of each ). But I still have some
>> > > > difficulty to understand the code ( in red ) below in Limma
>> > > analysis:
>> > > >
>> > > >         #fs is the 'standard' FirmaSet-object
>> > > >         fsDF <- extractDataFrame(fs, addNames=TRUE)
>> > > >         fsDF[,-c(1:5)] <- log2(fsDF[,-c(1:5)])    # I know why
>> > log2
>> > > > is here but confused by fsDF[,c(1:5)] .... what does this
>> > expression
>> > > > mean?
>> > >
>> > >
>> > > Note that it is -c(1:5), meaning operate on (here, take logs) all of
>> > > the columns except 1:5 ... that is, because extractDataFrame gives
>> > > some extra columns at the beginning that are NOT data, we only
>> > want to
>> > > log that columns that have actual data.
>> > >
>> > >
>> > > >         design <- cbind(Grp1=1,Grp2=c(rep(0,n_1),rep(1,n_2)))
>> > > >         fit<-lmFit(fsDF[,-c(1:5)],design)
>> > > >         fit<-eBayes(fit)
>> > > >         fit$genes<-fsDF[,1] # Can I also get seperate splicing
>> > > > patterns for the  two differentially spliced genes from two group
>> > > > (control and treatment )?
>> > >
>> > >
>> > > I'm not sure what you are asking here.  The probesets where the
>> > "Grp2"
>> > > coefficient is significantly different from 0 may highlight
>> > > differentially spliced exons.  Does that help?
>> > >
>> > > Mark
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > > Thanks in advance!
>> > > >
>> > > > Xinjun
>> > > >
>> > > > On Mon, Apr 27, 2009 at 6:19 AM, Mark Robinson
>> > > > <mrobin...@wehi.edu.au> wrote:
>> > > >
>> > > >
>> > > > Hi Xinjun.
>> > > >
>> > > > Quick comments below.
>> > > >
>> > > >
>> > > > > Hi Mark:
>> > > > >
>> > > > > Thanks very much for your help and I have have got a quick start
>> > > > on a
>> > > > > small
>> > > > > dataset that each group (control and treatment ) contains 4
>> > > > arrays. I have
>> > > > > set up a file structure like this:
>> > > > > =================================================
>> > > > > rawDate/
>> > > > >      controlGroup/
>> > > > >               HuEx-1_0-st-v1/
>> > > > >                       GSMXXXXXX.CEL
>> > > > >                       GSMXXXXXX.CEL
>> > > > >                       ............................
>> > > > >
>> > > > >      treatmentGroup/
>> > > > >               HuEx-1_0-st-v1/
>> > > > >                       GSMXXXXXX.CEL
>> > > > >                       GSMXXXXXX.CEL
>> > > > >                       ............................
>> > > > > ==================================================*
>> > > >
>> > > >
>> > > > This setup will need to be changed.  You will want to put ALL
>> > > samples
>> > > > together to do the PLM fitting, normalization, FIRMA scoring, etc.
>> > > >
>> > > > Something like:
>> > > >
>> > > > rawData/
>> > > >     thisExperiment/
>> > > >              HuEx-1_0-st-v1/
>> > > >                      sample1.CEL
>> > > >                      sample2.CEL
>> > > >                      ...
>> > > >                      sampleN.CEL
>> > > >
>> > > >
>> > > > >  This is my code ( my questions are in red):*
>> > > > >
>> > > > > library(aroma.affymetrix)
>> > > > >
>> > > > > #Getting annotation data files
>> > > > > chipType <- "HuEx-1_0-st-v1"
>> > > > > cdf <- AffymetrixCdfFile$byChipType(chipType)
>> > > > > print(cdf)
>> > > > >
>> > > > > #Defining CEL set
>> > > > > cs <- AffymetrixCelSet$byName("controlGroup", cdf=cdf)
>> > > > > print(cs)
>> > > > >
>> > > > > #Background Adjustment and Normalization
>> > > > > bc <- RmaBackgroundCorrection(cs)
>> > > > > csBC <- process(bc,verbose=verbose)
>> > > > >
>> > > > > #quantile normalization
>> > > > > qn <- QuantileNormalization(csBC, typesToUpdate="pm")  ### I set
>> > > the
>> > > > > second
>> > > > > parameter as "pm" as the chip type is Affymetrix exon array, is
>> > > that
>> > > > > right?
>> > > > > print(qn)
>> > > > > csN <- process(qn, verbose=verbose)
>> > > >
>> > > > This is fine.
>> > > >
>> > > > >
>> > > > > #Summarization
>> > > > > getCdf(csN)
>> > > > > ## * Fit exon-by-exon*, change the value of mergeGroups to FALSE
>> > > > in the
>> > > > > ExonRmaPlm() call above.
>> > > > > *plmEx *<- ExonRmaPlm(csN, mergeGroups=*FALSE*)
>> > > > > print(*plmEx*)
>> > > > > #To fit the PLM to all of the data, do:
>> > > > > fit(*plmEx*, verbose=verbose)
>> > > > > *
>> > > > > And here is my problem:*
>> > > > > firma <- FirmaModel(plmTr)  # I have noticd that FIRMA analysis
>> > > > ONLY works
>> > > > > from the PLM based on transcripts. So when the parameter is
>> > > plmTr, I
>> > > > > wonder
>> > > > > how can it detect the splicing events of genes ? Should not the
>> > > > parameter
>> > > > > be
>> > > > > plmEx?
>> > > > > fit(firma, verbose=verbose)
>> > > > > fs <- getFirmaScores(firma)
>> > > >
>> > > > Like it says on the group web page for Exon arrays: "The FIRMA
>> > > > analysis
>> > > > ONLY works from the PLM based on transcripts".  This is NOT an
>> > > error.
>> > > > That's the way it works.
>> > > >
>> > > > The manuscript gives more details for why this is the case:
>> > > >
>> http://bioinformatics.oxfordjournals.org/cgi/content/abstract/24/15/1707
>> > > >
>> > > > Hope that helps.
>> > > >
>> > > > Cheers,
>> > > > Mark
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > >
>> > > > >
>> > > > > On Fri, Apr 24, 2009 at 5:24 PM, Mark Robinson
>> > > > > <mrobin...@wehi.edu.au>wrote:
>> > > > >
>> > > > >>
>> > > > >> Hi Xinjun.
>> > > > >>
>> > > > >> Here is a quick sketch of what I might do.
>> > > > >>
>> > > > >> 1. Run everything to get FIRMA scores.  See group page for
>> > > running
>> > > > >> details and the Purdom Bioinformatics 2008 paper for
>> > > methodological
>> > > > >> details.
>> > > > >>
>> > > > >> 2a. If Nn or Nc > 1, use 'limma' to look for a difference in
>> > > FIRMA
>> > > > >> scores between your two groups.  See threads:
>> > > > >>
>> > > > >>
>> http://groups.google.com/group/aroma-affymetrix/browse_thread/thread/36d8c59d742fc503/
>> > > > >>
>> > > > >>
>> http://groups.google.com/group/aroma-affymetrix/browse_thread/thread/7d2645bd76cc2023/
>> > > > >>
>> > > > >> 2b. If you have say patient samples (and a good number of
>> > them),
>> > > > you
>> > > > >> might expect only a subset of your C or N patients to have a
>> > > > splicing
>> > > > >> aberration.  In this case, maybe you just want to look for
>> > large-
>> > > > in-
>> > > > >> magnitude FIRMA scores.
>> > > > >>
>> > > > >> ... maybe you also want to look at interesting probesets via
>> > > > >> GenomeGraphs:
>> > > > >>
>> > > > >>
>> http://groups.google.com/group/aroma-affymetrix/web/using-the-genomegraphs-package-with-firma
>> > > > >>
>> > > > >> Cheers,
>> > > > >> Mark
>> > > > >>
>> > > > >>
>> > > > >> On 24/04/2009, at 1:01 AM, liszt wrote:
>> > > > >>
>> > > > >> >
>> > > > >> > Hi:
>> > > > >> >
>> > > > >> > Now I have  N CEL files from both normal and cancer tissue.
>> > The
>> > > > two
>> > > > >> > groups contains Nn and Nc CEL files separately (Nn = Nc). I
>> > > > want to
>> > > > >> > investigate the difference in gene's splicing pattern
>> > between
>> > > > normal
>> > > > >> > tissue and cancerous tissue. Would you give me some tips? ( I
>> > > > have
>> > > > >> > read the document and got no answers) Thanks!
>> > > > >> >
>> > > > >> > Xinjun
>> > > > >> > >
>> > > > >>
>> > > > >> ------------------------------
>> > > > >> Mark Robinson
>> > > > >> Epigenetics Laboratory, Garvan
>> > > > >> Bioinformatics Division, WEHI
>> > > > >> e: m.robin...@garvan.org.au
>> > > > >> e: mrobin...@wehi.edu.au
>> > > > >> p: +61 (0)3 9345 2628
>> > > > >> f: +61 (0)3 9347 0852
>> > > > >> ------------------------------
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> >
>> > > > >>
>> > > > >
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > >
>> > > >
>> > >
>> > > ------------------------------
>> > > Mark Robinson
>> > > Epigenetics Laboratory, Garvan
>> > > Bioinformatics Division, WEHI
>> > > e: m.robin...@garvan.org.au
>> > > e: mrobin...@wehi.edu.au
>> > > p: +61 (0)3 9345 2628
>> > > f: +61 (0)3 9347 0852
>> > > ------------------------------
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > >
>> > >
>> >
>> > ------------------------------
>> > Mark Robinson
>> > Epigenetics Laboratory, Garvan
>> > Bioinformatics Division, WEHI
>> > e: m.robin...@garvan.org.au
>> > e: mrobin...@wehi.edu.au
>> > p: +61 (0)3 9345 2628
>> > f: +61 (0)3 9347 0852
>> > ------------------------------
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > >
>> >
>>
>> ------------------------------
>> Mark Robinson
>> Epigenetics Laboratory, Garvan
>> Bioinformatics Division, WEHI
>> e: m.robin...@garvan.org.au
>> e: mrobin...@wehi.edu.au
>> p: +61 (0)3 9345 2628
>> f: +61 (0)3 9347 0852
>> ------------------------------
>>
>>
>>
>>
>>
>> >
>>
>
> >
>



--~--~---------~--~----~------------~-------~--~----~
When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
version of the package, 2) to report the output of sessionInfo() and 
traceback(), and 3) to post a complete code example.


You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group.
To post to this group, send email to aroma-affymetrix@googlegroups.com
To unsubscribe from this group, send email to 
aroma-affymetrix-unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/aroma-affymetrix?hl=en
-~----------~----~----~----~------~----~------~--~---

[aroma.affymetrix] Re: Compare the splicing pattern of two samples

Reply via email to