Re: [R-sig-phylo] Model Selection and PGLS

Julien Clavel Sun, 04 Jul 2021 10:20:50 -0700


"There’s a way to impute phylogenetic signal back into a PGLS model? I am super 
surprised at that. I’ve talked to at least three different colleagues who use 
PGLS about this issue, and all of them had told me that there is no way to 
input phylogenetic signal back into the model for new data points and I should 
just go with the single regression line the model gives me (i.e., the 
regression line for the ancestral node). "

=> Not only it's possible, but this also needed to obtain the best predictions. 
The use of GLS in prediction date back to the early 60's at least (e.g. 
Goldberger 1962). You can find that in any books about GLS too. Theodore 
pointed to Garland & Ives 2000 for their use in a phylogenetic context, you 
have also Martins & Hansen 1997 and probably others. This is tightly related to 
methods used for ancestral state reconstructions (e.g. Goolsby et al. 2017) and 
is readily implemented in at least PhyloNetworks, Rphylopars and mvMORPH.

"I am not entirely sure what is meant here. Do you mean fitting both an OLS and 
BM model and comparing both models?"

=> I mean that using "OU" or "Pagel's lambda" as suggested in previous comments 
(or even more flexible or realistic models!), will be probably better than 
using a dichotomous decision between "no signal" or "BM" because they can 
accomodate both scenario but also some intermediate ones. This is also 
important for hypothesis testing, see for instance Revell 2010 and Clavel & 
Morlon 2020

"though I am just taking the fitted values from the PGLS model, which I don't 
think include phylogenetic information"

=> Again, this is less optimal (probably worse than OLS!) unless the 
phylogenetic model is used in the prediction (see the references above).

Cheers,

Julien

De : Russell Engelman <neovenatori...@gmail.com>
Envoyé : jeudi 1 juillet 2021 06:20
À : Julien Clavel <julien.cla...@hotmail.fr>
Cc : Theodore Garland <theodore.garl...@ucr.edu>; Cecile Ane 
<cecile....@wisc.edu>; mailman, r-sig-phylo <r-sig-phylo@r-project.org>
Objet : Re: [R-sig-phylo] Model Selection and PGLS 

Dear All,

What you see is the large uncertainty in “ancestral” states, which is part of 
the intercept here. The linear relationship that you overlaid on top of your 
data is the relationship predicted at the root of the tree (as if such a thing 
existed!). There is a lot of uncertainty about the intercept, but much less 
uncertainty in the slope. It looks like the slope is not affected by the 
inclusion or exclusion of monotremes. (for one possible reference on the 
greater precision in the slope versus the intercept, there’s this: 
http://dx.doi.org/10.1214/13-AOS1105 for the BM).

Yes, that sounds right from the other data I have. The line approximates what 
would be expected for the root of Mammalia, and the signal in the PGLS is more 
due to shifts in the y-intercept than shifts in slope, which in turn is 
supported by the anatomy of the proxy.

My second cent is that the phylogenetic predictions should be stable. The 
uncertainty in the intercept —and the large effect of including monotremes on 
the intercept— should not affect predictions, so long as you know for which 
species you want to make a prediction. If you want to make prediction for a 
species in a small clade “far” from monotremes, say, then the prediction is 
probably quite stable, even if you include monotremes: this is because the 
phylogenetic prediction should use the phylogenetic relationships for the 
species to be predicted. A prediction that uses the linear relationship at the 
root and ignores the placement of the species would be the worst-case scenario: 
for a mammal species with a completely unknown placement within mammals.

This is what I'm a bit confused about. I was always told (and it seemingly 
implies this in some of the PGLS literature I read like Rohlf 2011 and Smaers 
and Rohlf 2016) that it isn't possible to include phylogenetic data from the 
new data points into the prediction in order to improve predictions. I'm a 
little confused as to whether it's possible or not (see below).

There’s probably a number of software that do phylogenetic prediction. I know 
of Rphylopars and PhyloNetworks.

I will take a look into those.

I think that Cécile' and Theodore' point is important and too often overlooked. 
Using GLS models, the BLUP (Best Linear Unbiased Prediction) is not simply 
obtained from the fitted line but should incorporates information from the 
(evolutionary here) model.

     There’s a way to impute phylogenetic signal back into a PGLS model? I am 
super surprised at that. I’ve talked to at least three different colleagues who 
use PGLS about this issue, and all of them had told me that there is no way to 
input phylogenetic signal back into the model for new data points and I should 
just go with the single regression line the model gives me (i.e., the 
regression line for the ancestral node). 

     I tried looking around to see what previous researchers used when using 
PCM on body mass (Esteban-Trivigno and Köhler 2011, Campione and Evans 2012, 
Yapuncich 2017 thesis) and it looks like all of them just went with the best 
fit line with the ancestral node, i.e., looking at their reported results they 
give a simple trait~predictor equation that does not include phylogeny when 
calculating new data. Campion and Evans 2012 used PIC versus PGLS, which I know 
are technically equivalent but it doesn't seem like they included phylogenetic 
information when they predicted new data: they used their equations on 
dinosaurs but there are no dinosaurs in the tree they used. I know that it’s 
possible to incorporate phylogenetic signal into the new data using PVR but PVR 
has been criticized for other reasons.

    This is something that seems really, really concerning because if there is 
a method of using phylogenetic covariance to adjust the position of new data 
points it seems like a lot of workers don’t know these methods exist, to the 
point that even published papers overlook it. This was something I was hoping 
to highlight in a later paper on the data, but it sounds like people might have 
discussed it already. I remember talking with my colleagues a lot about "isn't 
there some way to incorporate phylogenetic information back into the model to 
improve accuracy of the prediction if we know where the taxon is positioned?" 
and they just thought there wasn't a way.

Regarding the model comparison, I would simply avoid it (or limit it) by 
fitting models flexible enough to accommodate between your BM and OLS case and 
summarize the results obtained across all the trees…

I am not entirely sure what is meant here. Do you mean fitting both an OLS and 
BM model and comparing both models? I am reporting both, but my concern is 
about which model I report is the best one to use going forward, since the BM 
model is seemingly less accurate (though I am just taking the fitted values 
from the PGLS model, which I don't think include phylogenetic information). The 
two models I use produce dramatically different results, for example the BM 
model produces body mass estimates which are 25% larger than OLS.

Right now PGLS is something I would avoid if I had the option (if for no other 
reason than not put all of the analyses in a single, overloaded manuscript [the 
manuscript is already about 90 pages] and deviate from the scope of the study), 
but I'm sure you know that most regression analyses nowadays require some sort 
of preliminary PCM to be acceptable.

Sincerely,
Russell

On Wed, Jun 30, 2021 at 10:24 AM Julien Clavel <julien.cla...@hotmail.fr> wrote:
I think that Cécile' and Theodore' point is important and too often overlooked. 
Using GLS models, the BLUP (Best Linear Unbiased Prediction) is not simply 
obtained from the fitted line but should incorporates information from the 
(evolutionary here) model.

For multivariate linear model you can also do it by specifying a tree including 
both the species used to build the model and the ones you want to predict using 
the “predict” function in mvMORPH (I think that Rphylopars can deal with 
multivariate phylogenetic regression too).

Regarding the model comparison, I would simply avoid it (or limit it) by 
fitting models flexible enough to accommodate between your BM and OLS case and 
summarize the results obtained across all the trees…

Julien

De : R-sig-phylo <r-sig-phylo-boun...@r-project.org> de la part de Theodore 
Garland <theodore.garl...@ucr.edu>
Envoyé : mercredi 30 juin 2021 03:26
À : Cecile Ane <cecile....@wisc.edu>
Cc : mailman, r-sig-phylo <r-sig-phylo@r-project.org>; neovenatori...@gmail.com 
<neovenatori...@gmail.com>
Objet : Re: [R-sig-phylo] Model Selection and PGLS 

All true.  I would just add two things.  First, always graph your data and
do ordinary OLS analyses as a reality check.

Second, I think this is the original paper for phylogenetic prediction:
Garland, Jr., T., and A. R. Ives. 2000. Using the past to predict the
present: confidence intervals for regression equations in phylogenetic
comparative methods. American Naturalist 155:346–364.
There, we talk about the Equivalency of the Independent-Contrasts and
Generalized Least Squares Approaches.

Cheers,
Ted

On Tue, Jun 29, 2021 at 5:01 PM Cecile Ane <cecile....@wisc.edu> wrote:

> Hi Russel,
>
> What you see is the large uncertainty in “ancestral” states, which is part
> of the intercept here. The linear relationship that you overlaid on top of
> your data is the relationship predicted at the root of the tree (as if such
> a thing existed!). There is a lot of uncertainty about the intercept, but
> much less uncertainty in the slope. It looks like the slope is not affected
> by the inclusion or exclusion of monotremes. (for one possible reference on
> the greater precision in the slope versus the intercept, there’s this:
> http://dx.doi.org/10.1214/13-AOS1105 for the BM).
>
> My second cent is that the phylogenetic predictions should be stable. The
> uncertainty in the intercept —and the large effect of including monotremes
> on the intercept— should not affect predictions, so long as you know for
> which species you want to make a prediction. If you want to make prediction
> for a species in a small clade “far” from monotremes, say, then the
> prediction is probably quite stable, even if you include monotremes: this
> is because the phylogenetic prediction should use the phylogenetic
> relationships for the species to be predicted. A prediction that uses the
> linear relationship at the root and ignores the placement of the species
> would be the worst-case scenario: for a mammal species with a completely
> unknown placement within mammals.
>
> There’s probably a number of software that do phylogenetic prediction. I
> know of Rphylopars and PhyloNetworks.
>
> my 2 cents…
> Cecile
>
> ---
> Cécile Ané, Professor (she/her)
> H. I. Romnes Faculty Fellow
> Departments of Statistics and of Botany
> University of Wisconsin - Madison
> www.stat.wisc.edu/~ane/<http://www.stat.wisc.edu/~ane/>
>
> CALS statistical consulting lab:
> https://calslab.cals.wisc.edu/stat-consulting/
>
>
>
> On Jun 29, 2021, at 5:37 PM, neovenatori...@gmail.com<mailto:
> neovenatori...@gmail.com> wrote:
>
> Dear All,
>
> So this is the main problem I'm facing (see attached figure, which should
> be small enough to post). When I calculate the best-fit line under a
> Brownian model, this produces a best-fit line that more or less bypasses
> the distribution of the data altogether. I did some testing and found that
> this result was driven solely by the presence of Monotremata, resulting in
> the model heavily downweighting all of the phylogenetic variation within
> Theria in favor of the deep divergence between Monotremata and Theria.
> Excluding Monotremata produces a PGLS fit that's comparable enough to the
> OLS and OU model fit to be justifiable (though I can't just throw out
> Monotremata for the sake of throwing it out).
>
> I am planning to do a more theoretical investigation into the effect of
> Monotremata on the PGLS fit in a future study, but right now what I am
> trying to do is perform a study in which I use this data to construct a
> regression model that can be used to predict new data. Which is why I am
> trying to use AIC to potentially justify going with OLS or an OU model over
> a Brownian model. From a practical perspective the Brownian model is almost
> unusable because it produces systematically biased estimates with high
> error rates when applied to new data (error rate is roughly double that of
> both the OLS and OU model). This is especially the case because the data
> must be back-transformed into an arithmetic scale to be useable, and thus a
> seemingly minor difference in regression models results in a massive
> difference in predicted values. However, I need some objective test to show
> that OLS fits the data better than the Brownian model, hence why I was
> going with AIC. Overall, OLS does seem to outperform the Brownian model on
> average, but the variation in AIC is so high it is hard to interpret this.
>
> This is kind of why I am leery of assuming a null Brownian model. A
> Brownian model, if anything, does not seem to accurately model the
> relationship between variables.
>
> This is why I am having trouble figuring out how to do model selection.
> Just going with accuracy statistics like percent error or standard error of
> the estimate OLS is better from a purely practical sense (it doesn't work
> for the monotreme taxa, but it turns out that estimate error in the
> monotremes is only decreased by 10% in a Brownian model when it
> overestimates mass by nearly 75%, so the improvement really isn't worth it
> and using this for monotremes isn't recommended in the first place), but
> the reviewers are expressing skepticism over the fact that the Brownian
> model produces less useable results. And I'm not entirely sure the best way
> to go about the PGLS if using one of the birth-death trees isn't ideal,
> perhaps what Dr. Upham says about using the DNA tree might work better.
>
> Ironically, an OU model might be argued to better fit the data, despite
> the concerns that Dr. Bapst mentioned. Looking at the distribution of
> signal even though signal is not random, it is more accurately described as
> most taxa hewing to a stable equilibrium with rapid, high magnitude shifts
> at certain evolutionary nodes, rather than the covariation between the two
> traits evolving in a Brownian fashion. I did some experiments with a PSR
> curve and the results seem to favor an OU model or other models with uneven
> rates of evolution rather than a pure Brownian model.
>
> Of course, the broader issue I am facing is trying to deal with PGLS
> succinctly; the scope of the study isn't necessarily an in-depth comparison
> between different regression models, it's more looking at how this variable
> correlates with body mass for practical purposes (for which considering
> phylogeny is one part of that). It's definitely something to consider but I
> am trying to avoid manuscript bloat.
>
> Sincerely,
> Russell
>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at
> http://www.mail-archive.com/r-sig-phylo@r-project.org/
>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/
_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] Model Selection and PGLS

Reply via email to