Re: [R-sig-phylo] asymmetric transitions

Jarrod Hadfield Fri, 17 Aug 2012 03:31:55 -0700

Hi,

Thanks for the Allman & Rhodes paper, it is very nice. For me at leastit confirms my suspicions, but made me realise that claims ofasymmetric transition rates are only suspicious if you are unpreparedto make some (strong?) assumptions. If anyone disagrees with what Ihave written below, then please tell me and I will try again tounderstand this stuff:

Identifiability is achieved because the pdf for the root state is thestationary distribution (denoted by sigma in Allman & Rhodes: seeexample 1). This is, I believe, the default in newer versions ofMesquite, although in older versions the distribution is 0.5/0.5 as inace.

If the pdf of the root state is defined by an additional parameter,this leaves a single parameter to describe the rate of transitions,and asymmetrical transition rates are non-identifiable. It seems tome there is a choice to be made between a) assuming the same processesafter the root held before the root and talk about asymmetrictransition rates or b) do not make this assumption and then admit thatthe rates of transition from 0->1 and 1->0 are not separable. I don'tthink the data can be used to distinguish between these view points,and so its a matter of personal choice which interpretation/model isused.


Cheers,

Jarrod








Quoting Mark Holder <mthol...@ku.edu> on Thu, 16 Aug 2012 23:41:45 -0500:

Hi,
I agree that model testing between ARD vs MK models is going to bemisleading when the process is really described by a threshold model(and sorry for ignoring that set of simulations by Jarrodpreviously; somehow I misfiled that email and didn't see it).
The threshold model has nice ways of dealing with correlations amongcharacters. However, when it is applied as the underlying model fora single binary character (as in Jarrod's sims), the threshold modelis similar to the single-site version of the covarion model (Tuffleyand Steel's version).
I don't think the models are identical, but they are quite similar.I suspect that if you generated a data set under one of the models,it would be quite hard to determine which was the generating model.Instead of just having a an "on" and "off" state (as in the covarionmodel), the threshold model has a continuum (the further theunderlying continuous trait is from boundary, the more "off" theobservable binary trait is). Allman and Rhodes (2009, ref below)proved some results on the identifiability of generalizations ofcovarion processes. They considered models with more hidden ratecategories (not just rate of zero and an rate of evolution when inthe "on" state). I believe that their results were that the numberof hidden rate categories that you can identify cannot exceed thenumber of observable states. So it may be hard to get much richerthan the Tuffley+Steel covarion when you have a binary character.
Which is a long way of saying that, it might be worth looking at thecovarion model variants for the types of data that Jarrod isinterested in. Implementations of the covarion model for two statesis quite fast and tractable. Testing Mk+covarion vs ARD+covarion mayindeed be a more robust way of
detecting asymmetry in rates of character transitions compared to Mk vs ARD.


Thanks for pointing out the Boettiger et al paper, Matt.


all the best,
Mark
[1] E. S. Allman and J. A. Rhodes, “The Identifiability of CovarionModels in Phylogenetics,” IEEE/ACM Transactions on ComputationalBiology and Bioinformatics, vol. 6, no. 1, pp. 76–88, Jan. 2009.
On Aug 16, 2012, at 10:07 PM, Matt Pennell wrote:
correction: the last sentence should have read
I wonder how that would work in this case. I think these areimportant questions going forward.
On Thu, Aug 16, 2012 at 11:00 PM, Matt Pennell <mwpenn...@gmail.com> wrote:
Hey all,
This has been a really fantastic discussion. Mark, you make somereally excellent points in response to my earlier comments. I thinkyou are correct in this.
The question that arises out of Jarrod and Dan's simulations (whichI have just run) is whether a model selection criteria would beable to distinguish MK from the threshold model that Felsenstein(and Wright before him) put forth? And how do we best assess modeladequacy? Carl Boettiger and company (2012: Evolution) suggested aPhylogenetic Monte Carlo approach for continuous characters. Iwonder how that would before I think these are important questionsgoing forward.
thanks again,
matt



On Thu, Aug 16, 2012 at 10:43 PM, Dan Rabosky <drabo...@umich.edu> wrote:

Hi all-
A couple of points. I am actually less concerned about the Type Ierror rates I gave in that previous message for the equal ratesmarkov process, even though I think they are real (e.g., I cancorroborate them using Diversitree). I don't think it is an issueof ascertainment bias, but I think Mark may be right about the LRTbeing inappropriate with few events on the tree and this may wellexplain the matter. This is probably worth exploring further.
However, I am much more concerned about Jarrod's second model (withunderlying continuous latent variable). This seems to be a seriousproblem, and if you simulate under the latent model, I think he isright that Type I error rates are really, really high. The model isreasonable: there is a continuous trait that influences theprobability of observing a particular tip state. In a practicalsense, this probably means the following: (i) some clades in thetree will essentially be fixed for the character in question. (ii)Other clades will appear to have high lability of the character.The clades that are fixed will be those clades where the underlyingthreshold character (e.g., mean clade value) drifts towards -Inf(or +Inf). Regardless of whether we think about this latent model,this at least leads to an interesting - and probably quite relevant- form of model misspecification. The model essentially inducessome extra heterogeneity in rates, such that some clades willappear to be switching quickly and others slowly. However, it isstill a symmetric model of sorts.
You can simulate data easily under this model and verify (usingwhatever software) that it is a problem. I'm attaching code thatdoes this. You can play around with 3 parameters: (i) the number oftaxa in the analysis (set to 100); (ii) the expected variance ofthe continuous latency factor (from roots to tip); and (iii) theroot state. These parameters are NTAXA, tipvar, and root in thecode below.
I'm keen to see what others think, but it looks to me like you cansimulate very reasonable-looking datasets and obtain extremelystrong support for an asymmetric model - even though the model isquasi-symmetric. So, if these hold, then I think this is a seriousissue - nothing we routinely do in the analysis of discretecharacters is designed to detect this sort of model misspecification.
## A single simulation

library(diversitree);
library(geiger);
library(mvtnorm);

NTAXA <- 101;

# Generate the tree:
x <- birthdeath.tree(b=1, d=0, taxa.stop=NTAXA);
x <- drop.tip(x, x$tip.label[x$edge[,2][x$edge.length==0][1]]);


vv <- vcv.phylo(x); # get phylogenetic vcv matrix

# Now we set the expected variance at the tips:
#       e.g., the value we want for the diagonal of the vcv matrix
# If this is = 1, you'll have a "phylogenetic" standardnormal distribution
#

tipvar <- 2;
sf <- tipvar/max(vv); #get scale factor for vcv matrix

vmat <- vv*sf; # scale matrix
root <- 0; #root state: this assumes the root is equally likely togive either state
mu <- rep(root, nrow(vmat));  #vector of means


# Simulate continuous, and then discrete, chars from
#               the corresponding mvn and binomial distributions
chars <- rmvnorm(1, mean=mu, sigma=vmat);
states <- rbinom(length(chars), 1, prob=plogis(chars));
names(states) <- x$tip.label;

# Look at the data...
plot(x, show.tip.label=F);
tiplabels(pch=21, bg = c('black', 'white')[(states+1)], col='black', cex=1);

#### Using Diverstree for model fitting:
lfx <- make.mk2(x, states); # The asymmetric likelihood function
lfxcon <- constrain(lfx, formulae = list(q01 ~ q10)); #constrainingq01 ~ q10
# Estimation...
l2 <- find.mle(lfx, runif(2, 0, 5))$lnLik;
l1 <- find.mle(lfxcon, runif(1, 0, 5))$lnLik;

#likelihood ratio test
lrt <- -2*l1 + 2*l2;
1 - pchisq(lrt, df=1);


### End sim

Cheers,
~Dan




On Aug 16, 2012, at 9:22 PM, Mark Holder wrote:

> Hi all,
>       <apologies for the long email>
>
> I'm a bit more concerned with Dan's elevated Type-1 error ratesthan Jarrod's example.
>
> With respect to Jarrod's simulations, I have a few thoughts:
> 1. I don't understand the claim (in the original email)that "its fairly straightforward to prove that asymmetrictransition rates cannot be identified using data collected on thetips of a phylogeny" It seems like this is something that isroutinely done in phylogenetics, and proofs of identifiability ofGTR exist (demonstrating that this indeed feasible and not somecomputational artifact).
>
> 2. I think that Dan's original response is correct. Weshould not expect to reject the null only 5% of the time if wesimulate a bias in the states of the tips. Simulating data withoutrespect to a phylogeny and then analyzing under Mk or "ARD" shouldbe like a simulation in which the rate of character evolution isessentially infinite. The Mk model predicts the states to beequally frequent, and so it is not problematic for a deviation from50:50 to favor the ARD model over the Mk model. In fact, thesimulation model is equivalent the ARD model with a rate ofevolution approaching infinity, so we should prefer the ARD model.
>
> 3. (in reference to Matt's post) In the ARD model, it ispossible for the MLE of the equilibrium state frequency of the lessfrequent state to be >0.5. Presumably, this is a rare occurrence,but I don't agree with the characterization of the ratio ofparameters converging to the ratio of states.
>
> Consider a clade with lots of tips in state 1 but tinybranch lengths. If this clade is found in the context of a treewith a few other branches that are long and lead to tips with state0, then you can get an MLE of the state frequency for state 0 being> 0.5. Most of the tips will have state 1, but because they areeasily explained by one transition to 1 you can still infer thatthe less frequent state has a higher equilibrium frequency.
>
> Perhaps, I'm mis-reading what Matt is referring to when hediscusses an analysis of a tree with "a lot of tips (ie.approaching the limit)." I do agree that if you simulate a verylarge tree under ARD (with the frequencies not equal to 0.5), thenthe frequency of the states at the tips will converge to theequilibrium state frequencies.
>
>
>
> With respect to Dan's results:
>
> The Type I error rate of 0.12 troubles me. Have you triedexporting the data and seeing if other software agrees with thelikelihood ratios returned by ace() ? I looked at the code for aceand nothing looked amiss to me (though my R skills are virtuallynon-existent).
>
> If the result is corroborated by other software, then my bestguesses would be:> 1. ascertainment bias (the simulation model clearlyexcludes constant patterns, but I don't believe that the inferencemodel in ace does any correction for this), and/or> 2. the assumption that you can use the chi-square as thenull distribution for the LRT probably breaks down when you havevery few events on the tree. In some sense the number of events isour measure of the amount of data, and when we have very few eventson the tree the asymptotic behavior of the LRT under the null isprobably not going to help us.
>
> In the limiting case, when rates of character change are so lowthat you never see homoplasy, then I think the LRT of the the twomodels should get close to 1 on virtually any realization(conditional on starting in state 1 and having exactly 1 change onthe tree, both model make the same predictions about the data; soin this realm the data should not prefer one model over the other).So, I'm not sure how the small data explanation would explain yourobservation of an excess of large LRT statistics.
>
>
> those are my 2 cents.
>
> all the best,
> Mark
>
>

_____________________
Dan Rabosky
Assistant Professor
Dept of Ecology and Evolutionary Biology
& Museum of Zoology
University of Michigan
drabo...@umich.edu
Mark Holder

mthol...@ku.edu
http://phylo.bio.ku.edu/mark-holder

==============================================
Department of Ecology and Evolutionary Biology
University of Kansas
6031 Haworth Hall
1200 Sunnyside Avenue
Lawrence, Kansas 66045

lab phone:  785.864.5789

fax (shared): 785.864.5860
==============================================




--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo

Re: [R-sig-phylo] asymmetric transitions

Reply via email to