Re: [R] Newbie clustering/classification question

Sean Davis Sun, 26 Mar 2006 03:54:39 -0800

Mark A. Miller wrote:
>       My laboratory is measuring the abundance of various proteins in the
> blood from either healthy individuals or from individuals with various
> diseases.  I would like to determine which proteins, if any, have
> significantly different abundances between the healthy and diseased
> individuals.  Currently, one of my colleagues is performing an ANOVA on
> each protein with MS Excel.  I would like to analyze the data sets with
> a scriptable tool, like R.  I could use another tool, but I am trying
> to stick to open source.  I have basic procedural programming skills (I
> do a lot of PHP/MySQL), but I'm not very good with anything that
> requires thinking in vectors and matrices. 
>       One approach I'm imagining is looping through all of the columns and
> doing an ANOVA, like my colleague is doing manually.  I have heard
> other people in my field talking about other tests for this kind of
> data.  Would a Kruskal-Wallis test, hierarchical data clustering,
> principal component analysis, or random forests be appropriate for the
> question I am asking?  If so, how would I write a reusable script for
> the test?  The data table will always have the same basic structure,
> but the number of proteins could vary, as could the number of
> conditions or the number of repeats within each condition.
>       I especially want to export the results of this test in a format
> roughly like the example below.  (I'd like the mean of each protein's
> abundance for each condition, some measure of variability within each
> condition, and a measure of significance for whether the protein
> abundances are different between conditions.)  I have gotten to the
> point of doing an ANOVA on a single protein R and viewing the results
> interactively, but I have no idea how to analyze the differences for
> all of the proteins (in a loop, or all at once) or how to save the
> results to a file.
> Any suggestions?
>
> Example input (tab delimited)
> condition     protA   protB   protC   protD   protE   protF   protG   protH
> healthy1      11111   22222   33333   70681   61735   66666   77777   88888
> healthy1      12121   21111   32132   57230   69715   67890   87878   98989
> healthy1      10101   20202   30303   67223   51967   65656   78900   111111
> healthy2      12345   23111   32100   65931   67650   60001   80001   101010
> healthy2      13333   21231   34111   58761   54086   60002   80002   122222
> healthy2      13232   20101   30009   68752   70360   60003   80003   91919
> asthma        32132   19889   30733   59959   71783   60237   65603   20374
> asthma        34344   20483   31182   70531   59630   40445   56370   98404
> asthma        39999   20464   29793   58395   66976   50577   39908   65367
> diabetes      10000   20102   29486   51260   68447   42960   50875   216227
> diabetes      10111   19143   31275   52573   55459   71337   53090   151505
> diabetes      10001   21790   31470   54222   57318   64058   44166   207427
> diabetes      15555   20123   30131   59882   71191   46203   44633   197430
> acne  12222   31221   51381   64431   55016   43463   60388   74243
> acne  12221   30535   49199   61419   65096   71551   41811   104317
> acne  10001   30649   49199   56731   69871   61816   44321   125068
>
>
> Desired output
> condition     protA   protB   protC   protD   protE   protF   protG   protH
> healthy1.mean                                                         
> healthy1.sd                                                           
> healthy1.pval                                                         
> healthy2.mean                                                         
> healthy2.sd                                                           
> healthy2.pval                                                         
> asthma.mean                                                           
> asthma.sd                                                             
> asthma.pval                                                           
> diabetes.mean                                                         
> diabetes.sd                                                           
> diabetes.pval                                                         
> acne.mean                                                             
> acne.sd                                                               
> acne.pval                                                             
>   
Hi, Mark.  With data like these, you will want to look at the 
BioConductor (http://www.bioconductor.org) project.  If you transpose 
your matrix so that individuals are in columns and proteins are in rows, 
then you have data in exactly the same form as a microarray analysis, so 
most of the tools in BioConductor will apply.  In addition, there are 
tools specifically designed for mass-spec data.  For your question 
directly, look at the limma package; it will do a protein-by-protein 
anova for you.  There is an extensive user guide available.


Sean

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Newbie clustering/classification question

Reply via email to