>Is this the way in which one decides for OLS vs. PGLS? If you have the same set of independent variables, then you just prefer the one (OLS or PGLS) with the higher likelihood. So far as I am told by Joe Felsenstein, you cannot do a ln maximum likelihood ratio test because the number of parameters is the same, although this paper seems to suggest otherwise: Mooers, A. O., S. M. Vamosi, and D. Schluter. 1999. Using phylogenies to test macroevolutionary hypotheses of trait evolution in cranes (Gruinae). American Naturalist 154:249-259.
For two models with the same set of independent variables, AIC does not add anything for you. If you go to something like Regression with an OU process modeled for the residuals, then you do have an additional parameter being estimated and so you can do an ln maximum likelihood ratio test of that model versus OLS and versus PGLS. For example, see: Lavin, S. R., W. H. Karasov, A. R. Ives, K. M. Middleton, and T. Garland, Jr. 2008. Morphometrics of the avian small intestine, compared with non-flying mammals: A phylogenetic perspective. Physiological and Biochemical Zoology 81:526-550. [provides Matlab Regressionv2.m, released as part of the PHYSIG package] Gartner, G. E. A., J. W. Hicks, P. R. Manzani, D. V. Andrade, A. S. Abe, T. Wang, S. M. Secor, and T. Garland, Jr. 2010. Phylogeny, ecology, and heart position in snakes. Physiological and Biochemical Zoology 83:43-54. Cheers, Ted Theodore Garland, Jr. Professor Department of Biology University of California, Riverside Riverside, CA 92521 Office Phone: (951) 827-3524 Wet Lab Phone: (951) 827-5724 Dry Lab Phone: (951) 827-4026 Home Phone: (951) 328-0820 Facsimile: (951) 827-4286 = Dept. office (not confidential) Email: tgarl...@ucr.edu http://www.biology.ucr.edu/people/faculty/Garland.html Experimental Evolution: Concepts, Methods, and Applications of Selection Experiments. 2009. Edited by Theodore Garland, Jr. and Michael R. Rose http://www.ucpress.edu/book.php?isbn=9780520261808 (PDFs of chapters are available from me or from the individual authors) ________________________________________ From: r-sig-phylo-boun...@r-project.org [r-sig-phylo-boun...@r-project.org] on behalf of Mattia Prosperi [ahn...@gmail.com] Sent: Wednesday, May 23, 2012 8:05 AM To: r-sig-phylo@r-project.org Subject: [R-sig-phylo] PIC or PGLS for genome-wide SNP screening Dear all, I am working on a data set composed of bacterial genomic sequences (a few genes) associated to phenotypic values (in-vitro resistance to antibiotics, a numerical value discretised into a binary class). Of note, the bacterial isolates were sampled non-uniformly at different times and locations, thus with a possible sampling bias. The data set is ~1,000 variables and ~1,000 observations. I have been applying several methods for developing a model to predict antibiotic resistance from the single nucleotide polymorphisms (SNP) extracted from a multiple alignment, applying classical statistical learning and feature selection methods. Eventually, I found that a logistic regression with main effects, where the variables were selected first by a univariable chi-square screening and then by AIC stepwise, was as good as other more complex and non-linear methods (such as random forests) by comparing different loss function (AUROC, specificity, sensitivity) distributions upon multiple cross-validation runs. Also, the SNP sets selected by different approaches were highly similar and consistent across several bootstrap evaluations. I found that a few relevant (even after Bonferroni correction) SNP were located in gene regions that are not supposed to be related with antibiotic resistance. I thought that this might be a consequence of neutral mutations that became fixed in the population by chance after a genetic bottleneck (e.g. antibiotic pressure). I'd like to understand if such SNP that is associated to antibiotic resistance (and actually not expected to be) is indeed just a random mutation of an early isolate that was carrying the true resistance SNP (in another gene region) and that was selected by the antibiotic pressure, thus transfering both the true resistance SNP and the "hitchhicking" ones to the offspring. Unfortunately it is not easy to cross-tabulate SNP in different genes because not all isolates have been sequenced the same set of genes. In order to check for fake/true SNP associated to resistance, I thought I might use a PIC or PGLS approach (after estimating a phylogenetic tree from the multiple alignment), in the same settings as the original analysis, i.e. a model selection approach with both feature and performance evaluation (well, since the coefficients of PGLS/OLS are the same, it's just a matter of standard errors and feature set selection), regressing the resistance class as a dependent variable and using the SNP as covariates. Is this a reasonable approach? Does it make sense to set up -for instance- an AIC stepwise selection within a PGLS modeling? I know that there is a way to check for phylogenetic signal and therefore decide if the PGLS approach shall be employed. Is this the way in which one decides for OLS vs. PGLS? Last but not least, which is the most appropriate covariance matrix calculation and PGLS implementation for this input-output set (i.e. categorical variables, binary class)? The "brunch" function within caper, or compar.gee within ape? Thanks and apologies if some of the questions are silly. M. Prosperi. _______________________________________________ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo _______________________________________________ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo