Re: [R-sig-phylo] Model Selection and PGLS

2021-07-02 Thread Simone Blomberg
I knew there would be a mistake somewhere. Here is the correction:

## C <- 1+(lambda-1)*(1-mat) ## Lambda correction WRONG!!
C <- (lambda*mat)+(1-lambda)*diag(diag(mat)) ## correct, I think..

I tested it on a toy example of a tree with 5 species and it works fine..

I hope there are no other mistakes! See other comments below.

Cheers,

Simone.

On 3/7/21 8:40 am, Russell Engelman wrote:
> Dear All,
>
> Okay, so I tried the code that Dr. Blomberg provided, and it doesn't 
> seem to work. I rewrote the code slightly in order for it to try and 
> predict the body mass of the taxa used to calculate the line, rather 
> than some new taxon, because I need to do that in order to calculate 
> prediction error and be able to compare it to previous studies. I got 
> a lambda of about 0.9, which is approximately the same as what I got 
> previously. So on the surface it seems like the method should work.
>
> The problem comes when I try to apply this to the data itself to 
> calculate mu. This is the modified code I used if that helps, though I 
> don't think I can send the data or tree to make it replicable because 
> the message wouldn't go through last time.
>
> tree.trimmed <- drop.tip(trees[[1]], which(is.na (b$bm)))
> b.trimmed <- b[complete.cases(b$bm),]
> fit <- gls(log(bm)~I(log(ocw)^(2/3)), corPagel(0.5, phy=tree.trimmed, 
> form=~species),
>            data=b.trimmed)
> (lambda <- as.numeric(fit$modelStruct))
> model.matFULL <- model.matrix(~I(log(ocw)^(2/3)), data=b)
> preds <- predict(fit,newdata=b)
> mat <- vcv.phylo(trees[[1]])
> C <- 1+(lambda-1)*(1-mat) ## Lambda correction
> XnoMean <- scale(model.matFULL, scale=FALSE)
> mu <- C %*% solve(C) %*%  XnoMean ## Replaced CCompletedata and cihmat 
> with C, since both seem to be subsets of that data. This is the only 
> place I could think things might go wrong.

C is the full variance-covariance matrix, made from the tree that 
includes both the known and unknown species. CCompletedata is the 
variance-covariance matrix WITHOUT the unknown species. This corresponds 
to C in Ted and Tony's paper.  cihmat is the matrix of relationships of 
the unknown species to the known species, one row for each unknown 
species. So you cannot replace cihmat with C.

> mu2<-(mu[match(b$species,row.names(mu)),])[,2] ## Have to re-sort mu 
> based on the data sheet. Mu is reported in phylogenetic order, whereas 
> predict.gls will always report the data in alphabetical order of the 
> row names even if a phylogeny is present.
predict.gls reports the predicted values in the order that is present in 
the newdata argument. You shouldn't have to re-order anything in my 
example code. YMMV however if you adapt my code.
> exp(preds+mu2) ## "correct" prediction accounting for bias due to 
> correlations with other species. Exp() is because data were 
> log-transformed.
>
> So when I check the results of this equation the predicted values are 
> actually /less /accurate than the fitted values reported from gls 
> without correction. Normally the accuracy is about +/- 64% of the 
> actual value, but adding the correction here results in the data 
> being +/- 84% of the actual value. What's more, I took a look at the 
> values of mu and they don't seem to make sense when compared on the 
> phylogeny. Namely, if mu were being calculated right, one would expect 
> monotremes to have a very negative value compared to all other mammals 
> given the deep divergence in trait states between monotremes and all 
> other taxa in the figure I previously attached. But when I check the 
> mu values, I get a value of 0.35 for /Zaglossus/, -0.58 for 
> /Tachyglossus/, and 0.59 for /Ornithorhynchus/. Which is not what 
> would be expected if mu were modelling a phylogenetic correction 
> factor (if nothing else, the range of mu should probably not span 
> zero). So it suggests to me I'm not doing something right. I know that 
> PGLS doesn't always result in better fits than OLS, but I'm surprised 
> adding signal back in made the results worse when the fit was so poor 
> when it was fitted through the ancestral node.
>
> Sincerely,
> Russell

-- 
Simone Blomberg, BSc (Hons), PhD, MAppStat (she/her)
Senior Lecturer and Consultant Statistician
School of Biological Sciences
The University of Queensland
St. Lucia Queensland 4072
Australia
T: +61 7 3365 2506
email: S.Blomberg1_at_uq.edu.au
Twitter: @simoneb66
UQ ALLY Supporting the diversity of sexuality and gender at UQ.

Policies:
1.  I will NOT analyse your data for you.
2.  Your deadline is your problem.

Basically, I'm not interested in doing research
and I never have been. I'm interested in
understanding, which is quite a different thing.
And often to understand something you have to
work it out for yourself because no one else
has done it. - David Blackwell


[[alternative HTML version deleted]]

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org

Re: [R-sig-phylo] Model Selection and PGLS

2021-07-02 Thread Russell Engelman
Dear All,

Okay, so I tried the code that Dr. Blomberg provided, and it doesn't seem
to work. I rewrote the code slightly in order for it to try and predict the
body mass of the taxa used to calculate the line, rather than some new
taxon, because I need to do that in order to calculate prediction error and
be able to compare it to previous studies. I got a lambda of about 0.9,
which is approximately the same as what I got previously. So on the surface
it seems like the method should work.

The problem comes when I try to apply this to the data itself to calculate
mu. This is the modified code I used if that helps, though I don't think I
can send the data or tree to make it replicable because the message
wouldn't go through last time.

tree.trimmed <- drop.tip(trees[[1]], which(is.na(b$bm)))
b.trimmed <- b[complete.cases(b$bm),]
fit <- gls(log(bm)~I(log(ocw)^(2/3)), corPagel(0.5, phy=tree.trimmed,
form=~species),
   data=b.trimmed)
(lambda <- as.numeric(fit$modelStruct))
model.matFULL <- model.matrix(~I(log(ocw)^(2/3)), data=b)
preds <- predict(fit,newdata=b)
mat <- vcv.phylo(trees[[1]])
C <- 1+(lambda-1)*(1-mat) ## Lambda correction
XnoMean <- scale(model.matFULL, scale=FALSE)
mu <- C %*% solve(C) %*%  XnoMean ## Replaced CCompletedata and cihmat with
C, since both seem to be subsets of that data. This is the only place I
could think things might go wrong.
mu2<-(mu[match(b$species,row.names(mu)),])[,2] ## Have to re-sort mu based
on the data sheet. Mu is reported in phylogenetic order, whereas
predict.gls will always report the data in alphabetical order of the row
names even if a phylogeny is present.
exp(preds+mu2) ## "correct" prediction accounting for bias due to
correlations with other species. Exp() is because data were log-transformed.

So when I check the results of this equation the predicted values are
actually *less *accurate than the fitted values reported from gls without
correction. Normally the accuracy is about +/- 64% of the actual value, but
adding the correction here results in the data being +/- 84% of the actual
value. What's more, I took a look at the values of mu and they don't seem
to make sense when compared on the phylogeny. Namely, if mu were being
calculated right, one would expect monotremes to have a very negative value
compared to all other mammals given the deep divergence in trait states
between monotremes and all other taxa in the figure I previously attached.
But when I check the mu values, I get a value of 0.35 for *Zaglossus*,
-0.58 for *Tachyglossus*, and 0.59 for *Ornithorhynchus*. Which is not what
would be expected if mu were modelling a phylogenetic correction factor (if
nothing else, the range of mu should probably not span zero). So it
suggests to me I'm not doing something right. I know that PGLS doesn't
always result in better fits than OLS, but I'm surprised adding signal back
in made the results worse when the fit was so poor when it was fitted
through the ancestral node.

Sincerely,
Russell

[[alternative HTML version deleted]]

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/


Re: [R-sig-phylo] Model Selection and PGLS

2021-07-02 Thread Chris Organ
Hi Russell,

And, for a fully PGLS and Bayesian model, see this:
https://pubmed.ncbi.nlm.nih.gov/17344851/

Best, Chris

On Thu, Jul 1, 2021 at 1:10 AM Theodore Garland
 wrote:
>
> Russell,
> Please read this paper:
> https://pubmed.ncbi.nlm.nih.gov/10718731/
> Cheers
> Ted
>
>
> On Wed, Jun 30, 2021, 9:21 PM Russell Engelman 
> wrote:
>
> > Dear All,
> >
> > What you see is the large uncertainty in “ancestral” states, which is part
> >> of the intercept here. The linear relationship that you overlaid on top of
> >> your data is the relationship predicted at the root of the tree (as if such
> >> a thing existed!). There is a lot of uncertainty about the intercept, but
> >> much less uncertainty in the slope. It looks like the slope is not affected
> >> by the inclusion or exclusion of monotremes. (for one possible reference on
> >> the greater precision in the slope versus the intercept, there’s this:
> >> http://dx.doi.org/10.1214/13-AOS1105 for the BM).
> >
> >
> > Yes, that sounds right from the other data I have. The line approximates
> > what would be expected for the root of Mammalia, and the signal in the PGLS
> > is more due to shifts in the y-intercept than shifts in slope, which in
> > turn is supported by the anatomy of the proxy.
> >
> > My second cent is that the phylogenetic predictions should be stable. The
> >> uncertainty in the intercept —and the large effect of including monotremes
> >> on the intercept— should not affect predictions, so long as you know for
> >> which species you want to make a prediction. If you want to make prediction
> >> for a species in a small clade “far” from monotremes, say, then the
> >> prediction is probably quite stable, even if you include monotremes: this
> >> is because the phylogenetic prediction should use the phylogenetic
> >> relationships for the species to be predicted. A prediction that uses the
> >> linear relationship at the root and ignores the placement of the species
> >> would be the worst-case scenario: for a mammal species with a completely
> >> unknown placement within mammals.
> >
> >
> > This is what I'm a bit confused about. I was always told (and it seemingly
> > implies this in some of the PGLS literature I read like Rohlf 2011 and
> > Smaers and Rohlf 2016) that it isn't possible to include phylogenetic data
> > from the new data points into the prediction in order to improve
> > predictions. I'm a little confused as to whether it's possible or not (see
> > below).
> >
> > There’s probably a number of software that do phylogenetic prediction. I
> >> know of Rphylopars and PhyloNetworks.
> >
> >
> > I will take a look into those.
> >
> > I think that Cécile' and Theodore' point is important and too often
> >> overlooked. Using GLS models, the BLUP (Best Linear Unbiased Prediction) is
> >> not simply obtained from the fitted line but should incorporates
> >> information from the (evolutionary here) model.
> >
> >
> >  There’s a way to impute phylogenetic signal back into a PGLS model? I
> > am super surprised at that. I’ve talked to at least three different
> > colleagues who use PGLS about this issue, and all of them had told me that
> > there is no way to input phylogenetic signal back into the model for new
> > data points and I should just go with the single regression line the model
> > gives me (i.e., the regression line for the ancestral node).
> >
> >  I tried looking around to see what previous researchers used when
> > using PCM on body mass (Esteban-Trivigno and Köhler 2011, Campione and
> > Evans 2012, Yapuncich 2017 thesis) and it looks like all of them just went
> > with the best fit line with the ancestral node, i.e., looking at their
> > reported results they give a simple trait~predictor equation that does not
> > include phylogeny when calculating new data. Campion and Evans 2012 used
> > PIC versus PGLS, which I know are technically equivalent but it doesn't
> > seem like they included phylogenetic information when they predicted new
> > data: they used their equations on dinosaurs but there are no dinosaurs in
> > the tree they used. I know that it’s possible to incorporate phylogenetic
> > signal into the new data using PVR but PVR has been criticized for other
> > reasons.
> >
> > This is something that seems really, really concerning because if
> > there is a method of using phylogenetic covariance to adjust the position
> > of new data points it seems like a lot of workers don’t know these methods
> > exist, to the point that even published papers overlook it. This was
> > something I was hoping to highlight in a later paper on the data, but it
> > sounds like people might have discussed it already. I remember talking with
> > my colleagues a lot about "isn't there some way to incorporate phylogenetic
> > information back into the model to improve accuracy of the prediction if we
> > know where the taxon is positioned?" and they just thought there wasn't a
> > way.
> >
> > Regarding the model