w00t! I am aware that `which` is inefficient, but wasn't sure how to get around it in this instance. Thanks!
It would be great to have this in the new version! Joseph. ________________________________________ Joseph W. Brown Post-doctoral Researcher, Smith Laboratory University of Michigan Department of Ecology & Evolutionary Biology Room 2071, Kraus Natural Sciences Building Ann Arbor MI 48109-1079 josep...@umich.edu > On 7 Jun, 2017, at 19:06, Klaus Schliep <klaus.schl...@gmail.com> wrote: > > Hi Joseph, > > just a few changes on your code and it is actually really fast. Having > multiple calls to which() is often slow. > Replacing which() with a simple lookup table will speed your code up quite a > bit as you can see below. > This is actually how the function Ancestors() in phangorn works. > But let's start with the output from profiling your code and you can see that > alt_mrca() spends most of the time in which(). > > Rprof(tmp <- tempfile()) > nd.alt <- alt_mrca(phy, tips) > Rprof() > summaryRprof(tmp) > unlink(tmp) > > $by.self > self.time self.pct total.time total.pct > "which" 1.56 100 1.56 100 > > $by.total > total.time total.pct self.time self.pct > "which" 1.56 100 1.56 100 > "alt_mrca" 1.56 100 0.00 0 > > $sample.interval > [1] 0.02 > > $sampling.time > [1] 1.56 > > Now here are some timings in comparison this version with one where I > replaced calls to which() with a lookup table: > > # ALT > system.time(nd.alt <- alt_mrca(phy, tips)); nd.alt; > user system elapsed > 1.552 0.004 1.555 > [1] 100003 > > > # ALT FAST > system.time(nd.alt <- alt_mrca_fast(phy, tips)); nd.alt; > user system elapsed > 0.004 0.000 0.004 > [1] 100003 > > Not too bad for some pure R code. > And now you really need to start thinking of another excuse for a coffee > break. > > Cheers, > Klaus > > PS: Emmanuel, just another function which could get into next version. > PPS: The plot should better have a logarithmic scale for time. > PPPS: Here is the improved code, just 2 additional lines. > > > alt_mrca_fast <- function (phy, tips) { > rootnd <- length(phy$tip.label) + 1; > pars <- numeric(); # parents in lineage of first queried tip > mrcaind <- 1; # index in pars of the mrca node > tnd <- NULL; # node ids of tips > if ("character" %in% class(tips)) { > tnd <- match(tips, phy$tip.label); > } else { > tnd <- tips; > } > > # build a lookup table to get parents faster > pvec <- integer(max(phy$edge)) > pvec[phy$edge[,2]] = phy$edge[,1] > > # get entire lineage for first tip > nd <- tnd[1]; > repeat { > # nd <- phy$edge[which(phy$edge[,2] == nd),1]; # get immediate parent > nd <- pvec[nd] > pars <- c(pars, nd); > if (nd == rootnd) { > break; > } > } > > > # traverse lineages for remaining tips, stop if hit common ancestor > for (i in 2:length(tnd)) { > done <- FALSE; > cnd <- tnd[i]; > while (!done) { > # cpar <- phy$edge[which(phy$edge[,2] == cnd),1]; # get immediate > parent > cpar <- pvec[cnd] > if (cpar %in% pars) { > if (cpar == rootnd) { > # early exit! if the mrca of any set of taxa is the root > then we are done > return(rootnd); > } > idx <- which(pars == cpar); > if (idx > mrcaind) { > mrcaind <- idx; > } > done <- TRUE; > } else { > # keep going! > cnd <- cpar; > } > } > } > return(pars[mrcaind]); > } > > > > > On Wed, Jun 7, 2017 at 4:24 PM, Joseph W. Brown <josep...@umich.edu > <mailto:josep...@umich.edu>> wrote: > I've been working with some very large trees (tens to hundreds of thousands > of tips) where the speed of APE's getMRCA function has been prohibitively > slow. I therefore R-ified an MRCA function I developed in Java for use with > the Open Tree of Life tree (~3 million tips). I don't pretend to be an > algorithm guy, but this seems to work. > > Given a tree `phy` and a set of query taxa `tips`, here is the algorithm > (note that the terminology has the root at the top of the tree): > 1) For the first tip, drill up through the tree, recording its entire lineage > (parental nodes) to the root in a vector. This seems wasteful (i.e., if the > MRCA is not the root) but in practice is actually fast. Call this vector > `pars`. > > Even computing all ancestors is actually not so slow: > system.time(tmp <- Ancestors(phy, 1:100000)) > user system elapsed > 0.084 0.000 0.085 > You need to traverse the whole tree only once, but you may run out of memory > for larger trees. > > > 2) For each subsequent tip, traverse its lineage up in the tree until > encountering an ancestor in `pars`. > 3) Return the highest indexed node in `pars` that was traversed by the other > lineages. > > A convenient "early exit" of this approach is that if any of the MRCAs are > the root then we can stop immediately. Another benefit with this is that the > query nodes do not necessarily have to be terminal taxa: MRCAs can be found > among terminal and internal nodes. > > So, how does it fare? Pretty good! Here are some results on a 100K tip tree > (the tree I am actually working with has 350K tips). > > # set up example > set.seed(13); > phy <- rtree(100000); > node <- 100003; # make sure MRCA is not the root > tt <- unlist(Descendants(phy, 100003)); > set.seed(13); > tips <- sample(tt, 50); > > # APE > system.time(nd.ape <- getMRCA(phy, tips)); nd.ape; > user system elapsed > 14.459 0.066 14.482 > [1] 100003 > > # ALT > system.time(nd.alt <- alt_mrca(phy, tips)); nd.alt; > user system elapsed > 2.557 0.534 3.063 > [1] 100003 > > Ok, that was for a shallow MRCA node (i.e. close to the root). How does it > fare for recent MRCAs? Here is a timing for sister tips: > > # APE > system.time(nd.ape <- getMRCA(phy, c(13,14))); nd.ape; > user system elapsed > 14.715 0.063 14.727 > [1] 100027 > > # ALT > system.time(nd.alt <- alt_mrca(phy, c(13,14))); nd.alt; > user system elapsed > 0.047 0.015 0.061 > [1] 100027 > > Way faster! Getting the entire lineage of the first tip does not seem to hurt > us. > > We can also look how the timing scales as the size of the query set changes > (I've made sure that the MRCA is identical in every case): > <Timing_tips.png> > (APE is blue, alt_mrca is red). So the APE version seems insensitive to the > number of tips query. I suppose there must be a point where the timings cross > and the alt_mrca function takes more time, but I have not found it. Besides, > if the query set becomes large enough then the MRCA is likely the root > itself, and in that case we may exit early: > > # APE with just 2 tips > system.time(nd.ape <- getMRCA(phy, c(1,100000))); nd.ape; > user system elapsed > 14.627 0.055 14.642 > [1] 100001 > > # ALT with a random 500 tips > system.time(nd.alt <- alt_mrca(phy, sample(1:100000, 500))); nd.alt; > user system elapsed > 0.292 0.058 0.346 > [1] 100001 > > Not bad. The code for the alternate MRCA is below. I've considered lapply-ing > this (i.e. across query nodes), but that would negate the potential early > exit (as all tip-to-root lineages would have to be traversed). Note that in > all of these timings above the APE getMRCA function uses C code but the > alternative is pure R. I imagine the alt_mrca could be made even speedier if > coded in C. > > HTH. > Joseph. > > alt_mrca <- function (phy, tips) { > rootnd <- length(phy$tip.label) + 1; > pars <- numeric(); # parents in lineage of first queried tip > mrcaind <- 1; # index in pars of the mrca node > tnd <- NULL; # node ids of tips > if ("character" %in% class(tips)) { > tnd <- match(tips, phy$tip.label); > } else { > tnd <- tips; > } > > # get entire lineage for first tip > nd <- tnd[1]; > repeat { > nd <- phy$edge[which(phy$edge[,2] == nd),1]; # get immediate parent > pars <- c(pars, nd); > if (nd == rootnd) { > break; > } > } > > # traverse lineages for remaining tips, stop if hit common ancestor > for (i in 2:length(tnd)) { > done <- FALSE; > cnd <- tnd[i]; > while (!done) { > cpar <- phy$edge[which(phy$edge[,2] == cnd),1]; # get immediate > parent > if (cpar %in% pars) { > if (cpar == rootnd) { > # early exit! if the mrca of any set of taxa is the root > then we are done > return(rootnd); > } > idx <- which(pars == cpar); > if (idx > mrcaind) { > mrcaind <- idx; > } > done <- TRUE; > } else { > # keep going! > cnd <- cpar; > } > } > } > return(pars[mrcaind]); > } > > ________________________________________ > Joseph W. Brown > Post-doctoral Researcher, Smith Laboratory > University of Michigan > Department of Ecology & Evolutionary Biology > Room 2071, Kraus Natural Sciences Building > Ann Arbor MI 48109-1079 > josep...@umich.edu <mailto:josep...@umich.edu> > > > > > _______________________________________________ > R-sig-phylo mailing list - R-sig-phylo@r-project.org > <mailto:R-sig-phylo@r-project.org> > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo > <https://stat.ethz.ch/mailman/listinfo/r-sig-phylo> > Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/ > <http://www.mail-archive.com/r-sig-phylo@r-project.org/> > > > > -- > Klaus Schliep > Postdoctoral Fellow > Revell Lab, University of Massachusetts Boston > http://www.phangorn.org/ <http://www.phangorn.org/> [[alternative HTML version deleted]] _______________________________________________ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/