I've been working with some very large trees (tens to hundreds of thousands of tips) where the speed of APE's getMRCA function has been prohibitively slow. I therefore R-ified an MRCA function I developed in Java for use with the Open Tree of Life tree (~3 million tips). I don't pretend to be an algorithm guy, but this seems to work.
Given a tree `phy` and a set of query taxa `tips`, here is the algorithm (note that the terminology has the root at the top of the tree): 1) For the first tip, drill up through the tree, recording its entire lineage (parental nodes) to the root in a vector. This seems wasteful (i.e., if the MRCA is not the root) but in practice is actually fast. Call this vector `pars`. 2) For each subsequent tip, traverse its lineage up in the tree until encountering an ancestor in `pars`. 3) Return the highest indexed node in `pars` that was traversed by the other lineages. A convenient "early exit" of this approach is that if any of the MRCAs are the root then we can stop immediately. Another benefit with this is that the query nodes do not necessarily have to be terminal taxa: MRCAs can be found among terminal and internal nodes. So, how does it fare? Pretty good! Here are some results on a 100K tip tree (the tree I am actually working with has 350K tips). # set up example set.seed(13); phy <- rtree(100000); node <- 100003; # make sure MRCA is not the root tt <- unlist(Descendants(phy, 100003)); set.seed(13); tips <- sample(tt, 50); # APE system.time(nd.ape <- getMRCA(phy, tips)); nd.ape; user system elapsed 14.459 0.066 14.482 [1] 100003 # ALT system.time(nd.alt <- alt_mrca(phy, tips)); nd.alt; user system elapsed 2.557 0.534 3.063 [1] 100003 Ok, that was for a shallow MRCA node (i.e. close to the root). How does it fare for recent MRCAs? Here is a timing for sister tips: # APE system.time(nd.ape <- getMRCA(phy, c(13,14))); nd.ape; user system elapsed 14.715 0.063 14.727 [1] 100027 # ALT system.time(nd.alt <- alt_mrca(phy, c(13,14))); nd.alt; user system elapsed 0.047 0.015 0.061 [1] 100027 Way faster! Getting the entire lineage of the first tip does not seem to hurt us. We can also look how the timing scales as the size of the query set changes (I've made sure that the MRCA is identical in every case): (APE is blue, alt_mrca is red). So the APE version seems insensitive to the number of tips query. I suppose there must be a point where the timings cross and the alt_mrca function takes more time, but I have not found it. Besides, if the query set becomes large enough then the MRCA is likely the root itself, and in that case we may exit early: # APE with just 2 tips system.time(nd.ape <- getMRCA(phy, c(1,100000))); nd.ape; user system elapsed 14.627 0.055 14.642 [1] 100001 # ALT with a random 500 tips system.time(nd.alt <- alt_mrca(phy, sample(1:100000, 500))); nd.alt; user system elapsed 0.292 0.058 0.346 [1] 100001 Not bad. The code for the alternate MRCA is below. I've considered lapply-ing this (i.e. across query nodes), but that would negate the potential early exit (as all tip-to-root lineages would have to be traversed). Note that in all of these timings above the APE getMRCA function uses C code but the alternative is pure R. I imagine the alt_mrca could be made even speedier if coded in C. HTH. Joseph. alt_mrca <- function (phy, tips) { rootnd <- length(phy$tip.label) + 1; pars <- numeric(); # parents in lineage of first queried tip mrcaind <- 1; # index in pars of the mrca node tnd <- NULL; # node ids of tips if ("character" %in% class(tips)) { tnd <- match(tips, phy$tip.label); } else { tnd <- tips; } # get entire lineage for first tip nd <- tnd[1]; repeat { nd <- phy$edge[which(phy$edge[,2] == nd),1]; # get immediate parent pars <- c(pars, nd); if (nd == rootnd) { break; } } # traverse lineages for remaining tips, stop if hit common ancestor for (i in 2:length(tnd)) { done <- FALSE; cnd <- tnd[i]; while (!done) { cpar <- phy$edge[which(phy$edge[,2] == cnd),1]; # get immediate parent if (cpar %in% pars) { if (cpar == rootnd) { # early exit! if the mrca of any set of taxa is the root then we are done return(rootnd); } idx <- which(pars == cpar); if (idx > mrcaind) { mrcaind <- idx; } done <- TRUE; } else { # keep going! cnd <- cpar; } } } return(pars[mrcaind]); } ________________________________________ Joseph W. Brown Post-doctoral Researcher, Smith Laboratory University of Michigan Department of Ecology & Evolutionary Biology Room 2071, Kraus Natural Sciences Building Ann Arbor MI 48109-1079 josep...@umich.edu
_______________________________________________ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/