correction: the last sentence should have read
I wonder how that would work in this case. I think these are
important questions going forward.
On Thu, Aug 16, 2012 at 11:00 PM, Matt Pennell <mwpenn...@gmail.com> wrote:
Hey all,
This has been a really fantastic discussion. Mark, you make some
really excellent points in response to my earlier comments. I think
you are correct in this.
The question that arises out of Jarrod and Dan's simulations (which
I have just run) is whether a model selection criteria would be
able to distinguish MK from the threshold model that Felsenstein
(and Wright before him) put forth? And how do we best assess model
adequacy? Carl Boettiger and company (2012: Evolution) suggested a
Phylogenetic Monte Carlo approach for continuous characters. I
wonder how that would before I think these are important questions
going forward.
thanks again,
matt
On Thu, Aug 16, 2012 at 10:43 PM, Dan Rabosky <drabo...@umich.edu> wrote:
Hi all-
A couple of points. I am actually less concerned about the Type I
error rates I gave in that previous message for the equal rates
markov process, even though I think they are real (e.g., I can
corroborate them using Diversitree). I don't think it is an issue
of ascertainment bias, but I think Mark may be right about the LRT
being inappropriate with few events on the tree and this may well
explain the matter. This is probably worth exploring further.
However, I am much more concerned about Jarrod's second model (with
underlying continuous latent variable). This seems to be a serious
problem, and if you simulate under the latent model, I think he is
right that Type I error rates are really, really high. The model is
reasonable: there is a continuous trait that influences the
probability of observing a particular tip state. In a practical
sense, this probably means the following: (i) some clades in the
tree will essentially be fixed for the character in question. (ii)
Other clades will appear to have high lability of the character.
The clades that are fixed will be those clades where the underlying
threshold character (e.g., mean clade value) drifts towards -Inf
(or +Inf). Regardless of whether we think about this latent model,
this at least leads to an interesting - and probably quite relevant
- form of model misspecification. The model essentially induces
some extra heterogeneity in rates, such that some clades will
appear to be switching quickly and others slowly. However, it is
still a symmetric model of sorts.
You can simulate data easily under this model and verify (using
whatever software) that it is a problem. I'm attaching code that
does this. You can play around with 3 parameters: (i) the number of
taxa in the analysis (set to 100); (ii) the expected variance of
the continuous latency factor (from roots to tip); and (iii) the
root state. These parameters are NTAXA, tipvar, and root in the
code below.
I'm keen to see what others think, but it looks to me like you can
simulate very reasonable-looking datasets and obtain extremely
strong support for an asymmetric model - even though the model is
quasi-symmetric. So, if these hold, then I think this is a serious
issue - nothing we routinely do in the analysis of discrete
characters is designed to detect this sort of model misspecification.
## A single simulation
library(diversitree);
library(geiger);
library(mvtnorm);
NTAXA <- 101;
# Generate the tree:
x <- birthdeath.tree(b=1, d=0, taxa.stop=NTAXA);
x <- drop.tip(x, x$tip.label[x$edge[,2][x$edge.length==0][1]]);
vv <- vcv.phylo(x); # get phylogenetic vcv matrix
# Now we set the expected variance at the tips:
# e.g., the value we want for the diagonal of the vcv matrix
# If this is = 1, you'll have a "phylogenetic" standard
normal distribution
#
tipvar <- 2;
sf <- tipvar/max(vv); #get scale factor for vcv matrix
vmat <- vv*sf; # scale matrix
root <- 0; #root state: this assumes the root is equally likely to
give either state
mu <- rep(root, nrow(vmat)); #vector of means
# Simulate continuous, and then discrete, chars from
# the corresponding mvn and binomial distributions
chars <- rmvnorm(1, mean=mu, sigma=vmat);
states <- rbinom(length(chars), 1, prob=plogis(chars));
names(states) <- x$tip.label;
# Look at the data...
plot(x, show.tip.label=F);
tiplabels(pch=21, bg = c('black', 'white')[(states+1)], col='black', cex=1);
#### Using Diverstree for model fitting:
lfx <- make.mk2(x, states); # The asymmetric likelihood function
lfxcon <- constrain(lfx, formulae = list(q01 ~ q10)); #constraining
q01 ~ q10
# Estimation...
l2 <- find.mle(lfx, runif(2, 0, 5))$lnLik;
l1 <- find.mle(lfxcon, runif(1, 0, 5))$lnLik;
#likelihood ratio test
lrt <- -2*l1 + 2*l2;
1 - pchisq(lrt, df=1);
### End sim
Cheers,
~Dan
On Aug 16, 2012, at 9:22 PM, Mark Holder wrote:
> Hi all,
> <apologies for the long email>
>
> I'm a bit more concerned with Dan's elevated Type-1 error rates
than Jarrod's example.
>
> With respect to Jarrod's simulations, I have a few thoughts:
> 1. I don't understand the claim (in the original email)
that "its fairly straightforward to prove that asymmetric
transition rates cannot be identified using data collected on the
tips of a phylogeny" It seems like this is something that is
routinely done in phylogenetics, and proofs of identifiability of
GTR exist (demonstrating that this indeed feasible and not some
computational artifact).
>
> 2. I think that Dan's original response is correct. We
should not expect to reject the null only 5% of the time if we
simulate a bias in the states of the tips. Simulating data without
respect to a phylogeny and then analyzing under Mk or "ARD" should
be like a simulation in which the rate of character evolution is
essentially infinite. The Mk model predicts the states to be
equally frequent, and so it is not problematic for a deviation from
50:50 to favor the ARD model over the Mk model. In fact, the
simulation model is equivalent the ARD model with a rate of
evolution approaching infinity, so we should prefer the ARD model.
>
> 3. (in reference to Matt's post) In the ARD model, it is
possible for the MLE of the equilibrium state frequency of the less
frequent state to be >0.5. Presumably, this is a rare occurrence,
but I don't agree with the characterization of the ratio of
parameters converging to the ratio of states.
>
> Consider a clade with lots of tips in state 1 but tiny
branch lengths. If this clade is found in the context of a tree
with a few other branches that are long and lead to tips with state
0, then you can get an MLE of the state frequency for state 0 being
> 0.5. Most of the tips will have state 1, but because they are
easily explained by one transition to 1 you can still infer that
the less frequent state has a higher equilibrium frequency.
>
> Perhaps, I'm mis-reading what Matt is referring to when he
discusses an analysis of a tree with "a lot of tips (ie.
approaching the limit)." I do agree that if you simulate a very
large tree under ARD (with the frequencies not equal to 0.5), then
the frequency of the states at the tips will converge to the
equilibrium state frequencies.
>
>
>
> With respect to Dan's results:
>
> The Type I error rate of 0.12 troubles me. Have you tried
exporting the data and seeing if other software agrees with the
likelihood ratios returned by ace() ? I looked at the code for ace
and nothing looked amiss to me (though my R skills are virtually
non-existent).
>
> If the result is corroborated by other software, then my best
guesses would be:
> 1. ascertainment bias (the simulation model clearly
excludes constant patterns, but I don't believe that the inference
model in ace does any correction for this), and/or
> 2. the assumption that you can use the chi-square as the
null distribution for the LRT probably breaks down when you have
very few events on the tree. In some sense the number of events is
our measure of the amount of data, and when we have very few events
on the tree the asymptotic behavior of the LRT under the null is
probably not going to help us.
>
> In the limiting case, when rates of character change are so low
that you never see homoplasy, then I think the LRT of the the two
models should get close to 1 on virtually any realization
(conditional on starting in state 1 and having exactly 1 change on
the tree, both model make the same predictions about the data; so
in this realm the data should not prefer one model over the other).
So, I'm not sure how the small data explanation would explain your
observation of an excess of large LRT statistics.
>
>
> those are my 2 cents.
>
> all the best,
> Mark
>
>
_____________________
Dan Rabosky
Assistant Professor
Dept of Ecology and Evolutionary Biology
& Museum of Zoology
University of Michigan
drabo...@umich.edu