[R-sig-phylo] PIC or PGLS for genome-wide SNP screening

2012-05-23 Thread Mattia Prosperi
Dear all,

I am working on a data set composed of bacterial genomic sequences (a
few genes) associated to phenotypic values (in-vitro resistance to
antibiotics, a numerical value discretised into a binary class). Of
note, the bacterial isolates were sampled non-uniformly at different
times and locations, thus with a possible sampling bias. The data set
is ~1,000 variables and ~1,000 observations.

I have been applying several methods for developing a model to predict
antibiotic resistance from the single nucleotide polymorphisms (SNP)
extracted from a multiple alignment, applying classical statistical
learning and feature selection methods.
Eventually, I found that a logistic regression with main effects,
where the variables were selected first by a univariable chi-square
screening and then by AIC stepwise, was as good as other more complex
and non-linear methods (such as random forests) by comparing different
loss function (AUROC, specificity, sensitivity) distributions  upon
multiple cross-validation runs. Also, the SNP sets selected by
different approaches were highly similar and consistent across several
bootstrap evaluations.

I found that a few relevant (even after Bonferroni correction) SNP
were located in gene regions that are not supposed to be related with
antibiotic resistance. I thought that this might be a consequence of
neutral mutations that became fixed in the population by chance after
a genetic bottleneck (e.g. antibiotic pressure).
I'd like to understand if such SNP that is associated to antibiotic
resistance (and actually not expected to be) is indeed just a random
mutation of an early isolate that was carrying the true resistance SNP
(in another gene region) and that was selected by the antibiotic
pressure, thus transfering both the true resistance SNP and the
hitchhicking ones to the offspring. Unfortunately it is not easy to
cross-tabulate SNP in different genes because not all isolates have
been sequenced the same set of genes.

In order to check for fake/true SNP associated to resistance, I
thought I might use a PIC or PGLS approach (after estimating a
phylogenetic tree from the multiple alignment), in the same settings
as the original analysis, i.e. a model selection approach with both
feature and performance evaluation (well, since the coefficients of
PGLS/OLS are the same, it's just a matter of standard errors and
feature set selection), regressing the resistance class as a dependent
variable and using the SNP as covariates.

Is this a reasonable approach? Does it make sense to set up -for
instance- an AIC stepwise selection within a PGLS modeling?
I know that there is a way to check for phylogenetic signal and
therefore decide if the PGLS approach shall be employed. Is this the
way in which one decides for OLS vs. PGLS?

Last but not least, which is the most appropriate covariance matrix
calculation and PGLS implementation for this input-output set (i.e.
categorical variables, binary class)? The brunch function within
caper, or compar.gee within ape?

Thanks and apologies if some of the questions are silly.

M. Prosperi.

___
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo


Re: [R-sig-phylo] PIC or PGLS for genome-wide SNP screening

2012-05-23 Thread Theodore Garland Jr
Is this the way in which one decides for OLS vs. PGLS?

If you have the same set of independent variables, then you just prefer the one 
(OLS or PGLS) with the higher likelihood.  So far as I am told by Joe 
Felsenstein, you cannot do a ln maximum likelihood ratio test because the 
number of parameters is the same, although this paper seems to suggest 
otherwise:
Mooers, A. O., S. M. Vamosi, and D. Schluter. 1999. Using phylogenies to test 
macroevolutionary hypotheses of trait evolution in cranes (Gruinae). American 
Naturalist 154:249-259.

For two models with the same set of independent variables, AIC does not add 
anything for you.

If you go to something like Regression with an OU process modeled for the 
residuals, then you do have an additional parameter being estimated and so you 
can do an ln maximum likelihood ratio test of that model versus OLS and versus 
PGLS.  For example, see:
Lavin, S. R., W. H. Karasov, A. R. Ives, K. M. Middleton, and T. Garland, Jr. 
2008. Morphometrics of the avian small intestine, compared with non-flying 
mammals: A phylogenetic perspective. Physiological and Biochemical Zoology 
81:526-550. [provides Matlab Regressionv2.m, released as part of the PHYSIG 
package]
Gartner, G. E. A., J. W. Hicks, P. R. Manzani, D. V. Andrade, A. S. Abe, T. 
Wang, S. M. Secor, and T. Garland, Jr. 2010. Phylogeny, ecology, and heart 
position in snakes. Physiological and Biochemical Zoology 83:43-54.

Cheers,
Ted

Theodore Garland, Jr.
Professor
Department of Biology
University of California, Riverside
Riverside, CA 92521
Office Phone:  (951) 827-3524
Wet Lab Phone:  (951) 827-5724
Dry Lab Phone:  (951) 827-4026
Home Phone:  (951) 328-0820
Facsimile:  (951) 827-4286 = Dept. office (not confidential)
Email:  tgarl...@ucr.edu
http://www.biology.ucr.edu/people/faculty/Garland.html

Experimental Evolution: Concepts, Methods, and Applications of Selection 
Experiments. 2009.
Edited by Theodore Garland, Jr. and Michael R. Rose
http://www.ucpress.edu/book.php?isbn=9780520261808
(PDFs of chapters are available from me or from the individual authors)


From: r-sig-phylo-boun...@r-project.org [r-sig-phylo-boun...@r-project.org] on 
behalf of Mattia Prosperi [ahn...@gmail.com]
Sent: Wednesday, May 23, 2012 8:05 AM
To: r-sig-phylo@r-project.org
Subject: [R-sig-phylo] PIC or PGLS for genome-wide SNP screening

Dear all,

I am working on a data set composed of bacterial genomic sequences (a
few genes) associated to phenotypic values (in-vitro resistance to
antibiotics, a numerical value discretised into a binary class). Of
note, the bacterial isolates were sampled non-uniformly at different
times and locations, thus with a possible sampling bias. The data set
is ~1,000 variables and ~1,000 observations.

I have been applying several methods for developing a model to predict
antibiotic resistance from the single nucleotide polymorphisms (SNP)
extracted from a multiple alignment, applying classical statistical
learning and feature selection methods.
Eventually, I found that a logistic regression with main effects,
where the variables were selected first by a univariable chi-square
screening and then by AIC stepwise, was as good as other more complex
and non-linear methods (such as random forests) by comparing different
loss function (AUROC, specificity, sensitivity) distributions  upon
multiple cross-validation runs. Also, the SNP sets selected by
different approaches were highly similar and consistent across several
bootstrap evaluations.

I found that a few relevant (even after Bonferroni correction) SNP
were located in gene regions that are not supposed to be related with
antibiotic resistance. I thought that this might be a consequence of
neutral mutations that became fixed in the population by chance after
a genetic bottleneck (e.g. antibiotic pressure).
I'd like to understand if such SNP that is associated to antibiotic
resistance (and actually not expected to be) is indeed just a random
mutation of an early isolate that was carrying the true resistance SNP
(in another gene region) and that was selected by the antibiotic
pressure, thus transfering both the true resistance SNP and the
hitchhicking ones to the offspring. Unfortunately it is not easy to
cross-tabulate SNP in different genes because not all isolates have
been sequenced the same set of genes.

In order to check for fake/true SNP associated to resistance, I
thought I might use a PIC or PGLS approach (after estimating a
phylogenetic tree from the multiple alignment), in the same settings
as the original analysis, i.e. a model selection approach with both
feature and performance evaluation (well, since the coefficients of
PGLS/OLS are the same, it's just a matter of standard errors and
feature set selection), regressing the resistance class as a dependent
variable and using the SNP as covariates.

Is this a reasonable approach? Does it make sense to set up -for
instance- an AIC stepwise selection 

[R-sig-phylo] General question about the impact of sampling bias in comparative and diversification analyses

2012-05-23 Thread Julien Lorion
Dear all,

I have a 5-gene phylogenetic tree, dated, quite well resolved, including 50 
species and covering 80~90% of the known diversity... I have also LASER 
diversification analyses and bayesian reconstruction of ancestral states for 5 
characters , both taking into account the impact of sampling biases...

I evaluate now the adaptive significance of 5 characters  with ape/caper (using 
Pagel's test, GEE and pGLS) AND their impact of diversification (using the Yule 
model with covariates). My results makes perfectly sense in terms of biology 
but I am worrying about their their relevance.  Indeed, for 2 of my characters, 
I have only 50% of data... In another hand, the distribution of data is well 
balanced among lineages... and I would say that corresponding patterns of 
evolution inferred from Bayesian analyses are pretty clear... 

I now the Beast team is working on an implementation of a Bayesian model for 
correlation analysis, that takes into account uncertainties (and thus missing 
data). But it is not ready yet. 

So I have a couple of -likely naive- questions: 
1. Is there a way to evaluate the impact of this kind of sampling bias on 
comparative analyses and diversification analyses ? 

2. My guess is that the biases will affect much more directly the Yule model 
with covariates than GEE, pGLS and Pagel's models... as removing taxa for which 
I don't have data from the tree will induce an additional bias on 
diversification rates. Am I right ?

So far, I have removed all missing data in analyses including those 2 
characters... My next question may sound crazy and/or desperate and I have a 
guess on the answer already :-), but... 
3. How about assuming that the patterns inferred in my reconstruction of 
ancestral states are correct (free from unsampled reversal or acquisition that 
may drastically challenge conclusions) and completing real data with most 
probable states sampled during Bayesian approach along tip branches ? and then 
compare with results obtained without missing data ?

4 Should I definitely forget about the two characters for which I have only 50% 
data? 

Thanks for any answer/comment/suggestion !

Regards

Julien



Julien Lorion
PhD, Post-doctoral fellow of the Japan Society for the Promotion of Science
Japan Agency for Marine-Earth Science and Technology (JAMSTEC)
Marine Ecosystems Research Department
2-15 Natsushima, Yokosuka 237-0061 Japan
Phone: +81-46-867-9570, Fax: +81-46-867-9525

___
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo


[R-sig-phylo] problem with fitcontinuos function

2012-05-23 Thread Agus Camacho
Dear all,

fitcontinuous is giving me a warning generally related to non
correspondence between tree tips and data. However name.check and the tree
tips seems ok.

Anybody has a hint on this?

Thanks in advance.
Agus

 name.check(tree, tp)[1] OK BMfit - fitContinuous(tree, tp$temperature, 
 model=BM)Warning: no tip labels, order assumed to be the same as in the tree
Fitting  BM model: tree
Phylogenetic tree with 11 tips and 10 internal nodes.

Tip labels:
C.leiolepis, C.nicterus, C.sinebrachiatus, S.catimbau, N.ablephara,
P.erythrocercus, ...
Node labels:
1, 2, 3, 4, 5, 6, ...

Rooted; includes branch lengths.



-- 
Agustín Camacho Guerrero.
Doutorando em Zoologia.
Laboratório de Herpetologia, Departamento de Zoologia, Instituto de
Biociências, USP.
Rua do Matão, trav. 14, nº 321, Cidade Universitária,
São Paulo - SP, CEP: 05508-090, Brasil.

[[alternative HTML version deleted]]

___
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo


Re: [R-sig-phylo] problem with fitcontinuos function

2012-05-23 Thread Renata Brandt
Hello Agus,

I believe your problem will be solved if you set row names to the object
tp$temperature.

An example follows:
names(temperature)-row.names(tp)

Please let me know if it worked.
Cheers, Renata


-- 
Renata Brandt
Departamento de Biologia - FFCLRP
Universidade de São Paulo
Ribeirão Preto - SP - Brasil
(11) 92756353
(16) 82039533

 /..)
(  (
 )  )
 (  (
  )  )
_(  (_
  )  )
  \ )
\



On Wed, May 23, 2012 at 5:16 PM, Agus Camacho agus.cama...@gmail.comwrote:

 Dear all,

 fitcontinuous is giving me a warning generally related to non
 correspondence between tree tips and data. However name.check and the tree
 tips seems ok.

 Anybody has a hint on this?

 Thanks in advance.
 Agus

  name.check(tree, tp)[1] OK BMfit - fitContinuous(tree,
 tp$temperature, model=BM)Warning: no tip labels, order assumed to be the
 same as in the tree
 Fitting  BM model: tree
 Phylogenetic tree with 11 tips and 10 internal nodes.

 Tip labels:
C.leiolepis, C.nicterus, C.sinebrachiatus, S.catimbau, N.ablephara,
 P.erythrocercus, ...
Node labels:
1, 2, 3, 4, 5, 6, ...

 Rooted; includes branch lengths.



 --
 Agustín Camacho Guerrero.
 Doutorando em Zoologia.
 Laboratório de Herpetologia, Departamento de Zoologia, Instituto de
 Biociências, USP.
 Rua do Matão, trav. 14, nº 321, Cidade Universitária,
 São Paulo - SP, CEP: 05508-090, Brasil.

[[alternative HTML version deleted]]


 ___
 R-sig-phylo mailing list
 R-sig-phylo@r-project.org
 https://stat.ethz.ch/mailman/listinfo/r-sig-phylo



[[alternative HTML version deleted]]

___
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo


Re: [R-sig-phylo] problem with fitcontinuos function

2012-05-23 Thread Agus Camacho
Thanks to all for your fast answer, this list is amazing!

Graham's answer was perfectly satisfactory, once onnly that eliminated the
warning.

Thanks to all again.
Agus

2012/5/23 Graham Slater gsla...@ucla.edu

 This doesn't indicate that they don't match - rather it's telling you that
 there are no names associated with the tip data (as tp$temperature is just
 a vector of numerical values) and so it is assuming that those values are
 in the same order as the tip labels in the tree.

 Assuming the taxon names are the rownames of your dataframe tp, try

   fitContinuous(phy = tree, data = tp$temperature, model=BM,
 data.names = rownames(tp)

 that should work without a warning. If your tip data are in the same order
 as the tip labels, the model output won't change. If they're not.

 Graham


 On May 23, 2012, at 1:16 PM, Agus Camacho wrote:

  Dear all,
 
  fitcontinuous is giving me a warning generally related to non
  correspondence between tree tips and data. However name.check and the
 tree
  tips seems ok.
 
  Anybody has a hint on this?
 
  Thanks in advance.
  Agus
 
  name.check(tree, tp)[1] OK BMfit - fitContinuous(tree,
 tp$temperature, model=BM)Warning: no tip labels, order assumed to be the
 same as in the tree
  Fitting  BM model: tree
  Phylogenetic tree with 11 tips and 10 internal nodes.
 
  Tip labels:
C.leiolepis, C.nicterus, C.sinebrachiatus, S.catimbau, N.ablephara,
  P.erythrocercus, ...
Node labels:
1, 2, 3, 4, 5, 6, ...
 
  Rooted; includes branch lengths.
 
 
 
  --
  Agustín Camacho Guerrero.
  Doutorando em Zoologia.
  Laboratório de Herpetologia, Departamento de Zoologia, Instituto de
  Biociências, USP.
  Rua do Matão, trav. 14, nº 321, Cidade Universitária,
  São Paulo - SP, CEP: 05508-090, Brasil.
 
[[alternative HTML version deleted]]
 
  ___
  R-sig-phylo mailing list
  R-sig-phylo@r-project.org
  https://stat.ethz.ch/mailman/listinfo/r-sig-phylo

 
 Graham Slater Ph. D.
 Department of Ecology and Evolutionary Biology
 University of California, Los Angeles
 610 Charles E. Young Drive South
 Los Angeles, CA 90095-1606

 (424) 442-4348
 gsla...@ucla.edu
 www.eeb.ucla.edu/gslater











-- 
Agustín Camacho Guerrero.
Doutorando em Zoologia.
Laboratório de Herpetologia, Departamento de Zoologia, Instituto de
Biociências, USP.
Rua do Matão, trav. 14, nº 321, Cidade Universitária,
São Paulo - SP, CEP: 05508-090, Brasil.

[[alternative HTML version deleted]]

___
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo