Re: [R-sig-phylo] Model Selection and PGLS

2021-06-30 Thread Russell Engelman
Dear All,

What you see is the large uncertainty in “ancestral” states, which is part
> of the intercept here. The linear relationship that you overlaid on top of
> your data is the relationship predicted at the root of the tree (as if such
> a thing existed!). There is a lot of uncertainty about the intercept, but
> much less uncertainty in the slope. It looks like the slope is not affected
> by the inclusion or exclusion of monotremes. (for one possible reference on
> the greater precision in the slope versus the intercept, there’s this:
> http://dx.doi.org/10.1214/13-AOS1105 for the BM).


Yes, that sounds right from the other data I have. The line approximates
what would be expected for the root of Mammalia, and the signal in the PGLS
is more due to shifts in the y-intercept than shifts in slope, which in
turn is supported by the anatomy of the proxy.

My second cent is that the phylogenetic predictions should be stable. The
> uncertainty in the intercept —and the large effect of including monotremes
> on the intercept— should not affect predictions, so long as you know for
> which species you want to make a prediction. If you want to make prediction
> for a species in a small clade “far” from monotremes, say, then the
> prediction is probably quite stable, even if you include monotremes: this
> is because the phylogenetic prediction should use the phylogenetic
> relationships for the species to be predicted. A prediction that uses the
> linear relationship at the root and ignores the placement of the species
> would be the worst-case scenario: for a mammal species with a completely
> unknown placement within mammals.


This is what I'm a bit confused about. I was always told (and it seemingly
implies this in some of the PGLS literature I read like Rohlf 2011 and
Smaers and Rohlf 2016) that it isn't possible to include phylogenetic data
from the new data points into the prediction in order to improve
predictions. I'm a little confused as to whether it's possible or not (see
below).

There’s probably a number of software that do phylogenetic prediction. I
> know of Rphylopars and PhyloNetworks.


I will take a look into those.

I think that Cécile' and Theodore' point is important and too often
> overlooked. Using GLS models, the BLUP (Best Linear Unbiased Prediction) is
> not simply obtained from the fitted line but should incorporates
> information from the (evolutionary here) model.


 There’s a way to impute phylogenetic signal back into a PGLS model? I
am super surprised at that. I’ve talked to at least three different
colleagues who use PGLS about this issue, and all of them had told me that
there is no way to input phylogenetic signal back into the model for new
data points and I should just go with the single regression line the model
gives me (i.e., the regression line for the ancestral node).

 I tried looking around to see what previous researchers used when
using PCM on body mass (Esteban-Trivigno and Köhler 2011, Campione and
Evans 2012, Yapuncich 2017 thesis) and it looks like all of them just went
with the best fit line with the ancestral node, i.e., looking at their
reported results they give a simple trait~predictor equation that does not
include phylogeny when calculating new data. Campion and Evans 2012 used
PIC versus PGLS, which I know are technically equivalent but it doesn't
seem like they included phylogenetic information when they predicted new
data: they used their equations on dinosaurs but there are no dinosaurs in
the tree they used. I know that it’s possible to incorporate phylogenetic
signal into the new data using PVR but PVR has been criticized for other
reasons.

This is something that seems really, really concerning because if there
is a method of using phylogenetic covariance to adjust the position of new
data points it seems like a lot of workers don’t know these methods exist,
to the point that even published papers overlook it. This was something I
was hoping to highlight in a later paper on the data, but it sounds like
people might have discussed it already. I remember talking with my
colleagues a lot about "isn't there some way to incorporate phylogenetic
information back into the model to improve accuracy of the prediction if we
know where the taxon is positioned?" and they just thought there wasn't a
way.

Regarding the model comparison, I would simply avoid it (or limit it) by
> fitting models flexible enough to accommodate between your BM and OLS case
> and summarize the results obtained across all the trees…


I am not entirely sure what is meant here. Do you mean fitting both an OLS
and BM model and comparing both models? I am reporting both, but my concern
is about which model I report is the best one to use going forward, since
the BM model is seemingly less accurate (though I am just taking the fitted
values from the PGLS model, which I don't think include phylogenetic
information). The two models I use produce dramatically different results,

Re: [R-sig-phylo] Model Selection and PGLS

2021-06-30 Thread Theodore Garland
Russell,
Please read this paper:
https://pubmed.ncbi.nlm.nih.gov/10718731/
Cheers
Ted


On Wed, Jun 30, 2021, 9:21 PM Russell Engelman 
wrote:

> Dear All,
>
> What you see is the large uncertainty in “ancestral” states, which is part
>> of the intercept here. The linear relationship that you overlaid on top of
>> your data is the relationship predicted at the root of the tree (as if such
>> a thing existed!). There is a lot of uncertainty about the intercept, but
>> much less uncertainty in the slope. It looks like the slope is not affected
>> by the inclusion or exclusion of monotremes. (for one possible reference on
>> the greater precision in the slope versus the intercept, there’s this:
>> http://dx.doi.org/10.1214/13-AOS1105 for the BM).
>
>
> Yes, that sounds right from the other data I have. The line approximates
> what would be expected for the root of Mammalia, and the signal in the PGLS
> is more due to shifts in the y-intercept than shifts in slope, which in
> turn is supported by the anatomy of the proxy.
>
> My second cent is that the phylogenetic predictions should be stable. The
>> uncertainty in the intercept —and the large effect of including monotremes
>> on the intercept— should not affect predictions, so long as you know for
>> which species you want to make a prediction. If you want to make prediction
>> for a species in a small clade “far” from monotremes, say, then the
>> prediction is probably quite stable, even if you include monotremes: this
>> is because the phylogenetic prediction should use the phylogenetic
>> relationships for the species to be predicted. A prediction that uses the
>> linear relationship at the root and ignores the placement of the species
>> would be the worst-case scenario: for a mammal species with a completely
>> unknown placement within mammals.
>
>
> This is what I'm a bit confused about. I was always told (and it seemingly
> implies this in some of the PGLS literature I read like Rohlf 2011 and
> Smaers and Rohlf 2016) that it isn't possible to include phylogenetic data
> from the new data points into the prediction in order to improve
> predictions. I'm a little confused as to whether it's possible or not (see
> below).
>
> There’s probably a number of software that do phylogenetic prediction. I
>> know of Rphylopars and PhyloNetworks.
>
>
> I will take a look into those.
>
> I think that Cécile' and Theodore' point is important and too often
>> overlooked. Using GLS models, the BLUP (Best Linear Unbiased Prediction) is
>> not simply obtained from the fitted line but should incorporates
>> information from the (evolutionary here) model.
>
>
>  There’s a way to impute phylogenetic signal back into a PGLS model? I
> am super surprised at that. I’ve talked to at least three different
> colleagues who use PGLS about this issue, and all of them had told me that
> there is no way to input phylogenetic signal back into the model for new
> data points and I should just go with the single regression line the model
> gives me (i.e., the regression line for the ancestral node).
>
>  I tried looking around to see what previous researchers used when
> using PCM on body mass (Esteban-Trivigno and Köhler 2011, Campione and
> Evans 2012, Yapuncich 2017 thesis) and it looks like all of them just went
> with the best fit line with the ancestral node, i.e., looking at their
> reported results they give a simple trait~predictor equation that does not
> include phylogeny when calculating new data. Campion and Evans 2012 used
> PIC versus PGLS, which I know are technically equivalent but it doesn't
> seem like they included phylogenetic information when they predicted new
> data: they used their equations on dinosaurs but there are no dinosaurs in
> the tree they used. I know that it’s possible to incorporate phylogenetic
> signal into the new data using PVR but PVR has been criticized for other
> reasons.
>
> This is something that seems really, really concerning because if
> there is a method of using phylogenetic covariance to adjust the position
> of new data points it seems like a lot of workers don’t know these methods
> exist, to the point that even published papers overlook it. This was
> something I was hoping to highlight in a later paper on the data, but it
> sounds like people might have discussed it already. I remember talking with
> my colleagues a lot about "isn't there some way to incorporate phylogenetic
> information back into the model to improve accuracy of the prediction if we
> know where the taxon is positioned?" and they just thought there wasn't a
> way.
>
> Regarding the model comparison, I would simply avoid it (or limit it) by
>> fitting models flexible enough to accommodate between your BM and OLS case
>> and summarize the results obtained across all the trees…
>
>
> I am not entirely sure what is meant here. Do you mean fitting both an OLS
> and BM model and comparing both models? I am reporting both, but my concern
> is about whic