[R-sig-phylo] PIC or PGLS for genome-wide SNP screening
Dear all, I am working on a data set composed of bacterial genomic sequences (a few genes) associated to phenotypic values (in-vitro resistance to antibiotics, a numerical value discretised into a binary class). Of note, the bacterial isolates were sampled non-uniformly at different times and locations, thus with a possible sampling bias. The data set is ~1,000 variables and ~1,000 observations. I have been applying several methods for developing a model to predict antibiotic resistance from the single nucleotide polymorphisms (SNP) extracted from a multiple alignment, applying classical statistical learning and feature selection methods. Eventually, I found that a logistic regression with main effects, where the variables were selected first by a univariable chi-square screening and then by AIC stepwise, was as good as other more complex and non-linear methods (such as random forests) by comparing different loss function (AUROC, specificity, sensitivity) distributions upon multiple cross-validation runs. Also, the SNP sets selected by different approaches were highly similar and consistent across several bootstrap evaluations. I found that a few relevant (even after Bonferroni correction) SNP were located in gene regions that are not supposed to be related with antibiotic resistance. I thought that this might be a consequence of neutral mutations that became fixed in the population by chance after a genetic bottleneck (e.g. antibiotic pressure). I'd like to understand if such SNP that is associated to antibiotic resistance (and actually not expected to be) is indeed just a random mutation of an early isolate that was carrying the true resistance SNP (in another gene region) and that was selected by the antibiotic pressure, thus transfering both the true resistance SNP and the hitchhicking ones to the offspring. Unfortunately it is not easy to cross-tabulate SNP in different genes because not all isolates have been sequenced the same set of genes. In order to check for fake/true SNP associated to resistance, I thought I might use a PIC or PGLS approach (after estimating a phylogenetic tree from the multiple alignment), in the same settings as the original analysis, i.e. a model selection approach with both feature and performance evaluation (well, since the coefficients of PGLS/OLS are the same, it's just a matter of standard errors and feature set selection), regressing the resistance class as a dependent variable and using the SNP as covariates. Is this a reasonable approach? Does it make sense to set up -for instance- an AIC stepwise selection within a PGLS modeling? I know that there is a way to check for phylogenetic signal and therefore decide if the PGLS approach shall be employed. Is this the way in which one decides for OLS vs. PGLS? Last but not least, which is the most appropriate covariance matrix calculation and PGLS implementation for this input-output set (i.e. categorical variables, binary class)? The brunch function within caper, or compar.gee within ape? Thanks and apologies if some of the questions are silly. M. Prosperi. ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Re: [R-sig-phylo] PIC or PGLS for genome-wide SNP screening
Is this the way in which one decides for OLS vs. PGLS? If you have the same set of independent variables, then you just prefer the one (OLS or PGLS) with the higher likelihood. So far as I am told by Joe Felsenstein, you cannot do a ln maximum likelihood ratio test because the number of parameters is the same, although this paper seems to suggest otherwise: Mooers, A. O., S. M. Vamosi, and D. Schluter. 1999. Using phylogenies to test macroevolutionary hypotheses of trait evolution in cranes (Gruinae). American Naturalist 154:249-259. For two models with the same set of independent variables, AIC does not add anything for you. If you go to something like Regression with an OU process modeled for the residuals, then you do have an additional parameter being estimated and so you can do an ln maximum likelihood ratio test of that model versus OLS and versus PGLS. For example, see: Lavin, S. R., W. H. Karasov, A. R. Ives, K. M. Middleton, and T. Garland, Jr. 2008. Morphometrics of the avian small intestine, compared with non-flying mammals: A phylogenetic perspective. Physiological and Biochemical Zoology 81:526-550. [provides Matlab Regressionv2.m, released as part of the PHYSIG package] Gartner, G. E. A., J. W. Hicks, P. R. Manzani, D. V. Andrade, A. S. Abe, T. Wang, S. M. Secor, and T. Garland, Jr. 2010. Phylogeny, ecology, and heart position in snakes. Physiological and Biochemical Zoology 83:43-54. Cheers, Ted Theodore Garland, Jr. Professor Department of Biology University of California, Riverside Riverside, CA 92521 Office Phone: (951) 827-3524 Wet Lab Phone: (951) 827-5724 Dry Lab Phone: (951) 827-4026 Home Phone: (951) 328-0820 Facsimile: (951) 827-4286 = Dept. office (not confidential) Email: tgarl...@ucr.edu http://www.biology.ucr.edu/people/faculty/Garland.html Experimental Evolution: Concepts, Methods, and Applications of Selection Experiments. 2009. Edited by Theodore Garland, Jr. and Michael R. Rose http://www.ucpress.edu/book.php?isbn=9780520261808 (PDFs of chapters are available from me or from the individual authors) From: r-sig-phylo-boun...@r-project.org [r-sig-phylo-boun...@r-project.org] on behalf of Mattia Prosperi [ahn...@gmail.com] Sent: Wednesday, May 23, 2012 8:05 AM To: r-sig-phylo@r-project.org Subject: [R-sig-phylo] PIC or PGLS for genome-wide SNP screening Dear all, I am working on a data set composed of bacterial genomic sequences (a few genes) associated to phenotypic values (in-vitro resistance to antibiotics, a numerical value discretised into a binary class). Of note, the bacterial isolates were sampled non-uniformly at different times and locations, thus with a possible sampling bias. The data set is ~1,000 variables and ~1,000 observations. I have been applying several methods for developing a model to predict antibiotic resistance from the single nucleotide polymorphisms (SNP) extracted from a multiple alignment, applying classical statistical learning and feature selection methods. Eventually, I found that a logistic regression with main effects, where the variables were selected first by a univariable chi-square screening and then by AIC stepwise, was as good as other more complex and non-linear methods (such as random forests) by comparing different loss function (AUROC, specificity, sensitivity) distributions upon multiple cross-validation runs. Also, the SNP sets selected by different approaches were highly similar and consistent across several bootstrap evaluations. I found that a few relevant (even after Bonferroni correction) SNP were located in gene regions that are not supposed to be related with antibiotic resistance. I thought that this might be a consequence of neutral mutations that became fixed in the population by chance after a genetic bottleneck (e.g. antibiotic pressure). I'd like to understand if such SNP that is associated to antibiotic resistance (and actually not expected to be) is indeed just a random mutation of an early isolate that was carrying the true resistance SNP (in another gene region) and that was selected by the antibiotic pressure, thus transfering both the true resistance SNP and the hitchhicking ones to the offspring. Unfortunately it is not easy to cross-tabulate SNP in different genes because not all isolates have been sequenced the same set of genes. In order to check for fake/true SNP associated to resistance, I thought I might use a PIC or PGLS approach (after estimating a phylogenetic tree from the multiple alignment), in the same settings as the original analysis, i.e. a model selection approach with both feature and performance evaluation (well, since the coefficients of PGLS/OLS are the same, it's just a matter of standard errors and feature set selection), regressing the resistance class as a dependent variable and using the SNP as covariates. Is this a reasonable approach? Does it make sense to set up -for instance- an AIC stepwise selection
[R-sig-phylo] General question about the impact of sampling bias in comparative and diversification analyses
Dear all, I have a 5-gene phylogenetic tree, dated, quite well resolved, including 50 species and covering 80~90% of the known diversity... I have also LASER diversification analyses and bayesian reconstruction of ancestral states for 5 characters , both taking into account the impact of sampling biases... I evaluate now the adaptive significance of 5 characters with ape/caper (using Pagel's test, GEE and pGLS) AND their impact of diversification (using the Yule model with covariates). My results makes perfectly sense in terms of biology but I am worrying about their their relevance. Indeed, for 2 of my characters, I have only 50% of data... In another hand, the distribution of data is well balanced among lineages... and I would say that corresponding patterns of evolution inferred from Bayesian analyses are pretty clear... I now the Beast team is working on an implementation of a Bayesian model for correlation analysis, that takes into account uncertainties (and thus missing data). But it is not ready yet. So I have a couple of -likely naive- questions: 1. Is there a way to evaluate the impact of this kind of sampling bias on comparative analyses and diversification analyses ? 2. My guess is that the biases will affect much more directly the Yule model with covariates than GEE, pGLS and Pagel's models... as removing taxa for which I don't have data from the tree will induce an additional bias on diversification rates. Am I right ? So far, I have removed all missing data in analyses including those 2 characters... My next question may sound crazy and/or desperate and I have a guess on the answer already :-), but... 3. How about assuming that the patterns inferred in my reconstruction of ancestral states are correct (free from unsampled reversal or acquisition that may drastically challenge conclusions) and completing real data with most probable states sampled during Bayesian approach along tip branches ? and then compare with results obtained without missing data ? 4 Should I definitely forget about the two characters for which I have only 50% data? Thanks for any answer/comment/suggestion ! Regards Julien Julien Lorion PhD, Post-doctoral fellow of the Japan Society for the Promotion of Science Japan Agency for Marine-Earth Science and Technology (JAMSTEC) Marine Ecosystems Research Department 2-15 Natsushima, Yokosuka 237-0061 Japan Phone: +81-46-867-9570, Fax: +81-46-867-9525 ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
[R-sig-phylo] problem with fitcontinuos function
Dear all, fitcontinuous is giving me a warning generally related to non correspondence between tree tips and data. However name.check and the tree tips seems ok. Anybody has a hint on this? Thanks in advance. Agus name.check(tree, tp)[1] OK BMfit - fitContinuous(tree, tp$temperature, model=BM)Warning: no tip labels, order assumed to be the same as in the tree Fitting BM model: tree Phylogenetic tree with 11 tips and 10 internal nodes. Tip labels: C.leiolepis, C.nicterus, C.sinebrachiatus, S.catimbau, N.ablephara, P.erythrocercus, ... Node labels: 1, 2, 3, 4, 5, 6, ... Rooted; includes branch lengths. -- Agustín Camacho Guerrero. Doutorando em Zoologia. Laboratório de Herpetologia, Departamento de Zoologia, Instituto de Biociências, USP. Rua do Matão, trav. 14, nº 321, Cidade Universitária, São Paulo - SP, CEP: 05508-090, Brasil. [[alternative HTML version deleted]] ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Re: [R-sig-phylo] problem with fitcontinuos function
Hello Agus, I believe your problem will be solved if you set row names to the object tp$temperature. An example follows: names(temperature)-row.names(tp) Please let me know if it worked. Cheers, Renata -- Renata Brandt Departamento de Biologia - FFCLRP Universidade de São Paulo Ribeirão Preto - SP - Brasil (11) 92756353 (16) 82039533 /..) ( ( ) ) ( ( ) ) _( (_ ) ) \ ) \ On Wed, May 23, 2012 at 5:16 PM, Agus Camacho agus.cama...@gmail.comwrote: Dear all, fitcontinuous is giving me a warning generally related to non correspondence between tree tips and data. However name.check and the tree tips seems ok. Anybody has a hint on this? Thanks in advance. Agus name.check(tree, tp)[1] OK BMfit - fitContinuous(tree, tp$temperature, model=BM)Warning: no tip labels, order assumed to be the same as in the tree Fitting BM model: tree Phylogenetic tree with 11 tips and 10 internal nodes. Tip labels: C.leiolepis, C.nicterus, C.sinebrachiatus, S.catimbau, N.ablephara, P.erythrocercus, ... Node labels: 1, 2, 3, 4, 5, 6, ... Rooted; includes branch lengths. -- Agustín Camacho Guerrero. Doutorando em Zoologia. Laboratório de Herpetologia, Departamento de Zoologia, Instituto de Biociências, USP. Rua do Matão, trav. 14, nº 321, Cidade Universitária, São Paulo - SP, CEP: 05508-090, Brasil. [[alternative HTML version deleted]] ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo [[alternative HTML version deleted]] ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Re: [R-sig-phylo] problem with fitcontinuos function
Thanks to all for your fast answer, this list is amazing! Graham's answer was perfectly satisfactory, once onnly that eliminated the warning. Thanks to all again. Agus 2012/5/23 Graham Slater gsla...@ucla.edu This doesn't indicate that they don't match - rather it's telling you that there are no names associated with the tip data (as tp$temperature is just a vector of numerical values) and so it is assuming that those values are in the same order as the tip labels in the tree. Assuming the taxon names are the rownames of your dataframe tp, try fitContinuous(phy = tree, data = tp$temperature, model=BM, data.names = rownames(tp) that should work without a warning. If your tip data are in the same order as the tip labels, the model output won't change. If they're not. Graham On May 23, 2012, at 1:16 PM, Agus Camacho wrote: Dear all, fitcontinuous is giving me a warning generally related to non correspondence between tree tips and data. However name.check and the tree tips seems ok. Anybody has a hint on this? Thanks in advance. Agus name.check(tree, tp)[1] OK BMfit - fitContinuous(tree, tp$temperature, model=BM)Warning: no tip labels, order assumed to be the same as in the tree Fitting BM model: tree Phylogenetic tree with 11 tips and 10 internal nodes. Tip labels: C.leiolepis, C.nicterus, C.sinebrachiatus, S.catimbau, N.ablephara, P.erythrocercus, ... Node labels: 1, 2, 3, 4, 5, 6, ... Rooted; includes branch lengths. -- Agustín Camacho Guerrero. Doutorando em Zoologia. Laboratório de Herpetologia, Departamento de Zoologia, Instituto de Biociências, USP. Rua do Matão, trav. 14, nº 321, Cidade Universitária, São Paulo - SP, CEP: 05508-090, Brasil. [[alternative HTML version deleted]] ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Graham Slater Ph. D. Department of Ecology and Evolutionary Biology University of California, Los Angeles 610 Charles E. Young Drive South Los Angeles, CA 90095-1606 (424) 442-4348 gsla...@ucla.edu www.eeb.ucla.edu/gslater -- Agustín Camacho Guerrero. Doutorando em Zoologia. Laboratório de Herpetologia, Departamento de Zoologia, Instituto de Biociências, USP. Rua do Matão, trav. 14, nº 321, Cidade Universitária, São Paulo - SP, CEP: 05508-090, Brasil. [[alternative HTML version deleted]] ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo