w00t!

I am aware that `which` is inefficient, but wasn't sure how to get around it in 
this instance. Thanks!

It would be great to have this in the new version!

Joseph.
________________________________________
Joseph W. Brown
Post-doctoral Researcher, Smith Laboratory
University of Michigan
Department of Ecology & Evolutionary Biology
Room 2071, Kraus Natural Sciences Building
Ann Arbor MI 48109-1079
josep...@umich.edu



> On 7 Jun, 2017, at 19:06, Klaus Schliep <klaus.schl...@gmail.com> wrote:
> 
> Hi Joseph, 
> 
> just a few changes on your code and it is actually really fast. Having 
> multiple calls to which() is often slow.  
> Replacing which() with a simple lookup table will speed your code up quite a 
> bit as you can see below. 
> This is actually how the function Ancestors() in phangorn works. 
> But let's start with the output from profiling your code and you can see that 
> alt_mrca() spends most of the time in which(). 
> 
> Rprof(tmp <- tempfile())
> nd.alt <- alt_mrca(phy, tips)
> Rprof()
> summaryRprof(tmp)
> unlink(tmp)
> 
> $by.self
>         self.time self.pct total.time total.pct
> "which"      1.56      100       1.56       100
> 
> $by.total
>            total.time total.pct self.time self.pct
> "which"          1.56       100      1.56      100
> "alt_mrca"       1.56       100      0.00        0
> 
> $sample.interval
> [1] 0.02
> 
> $sampling.time
> [1] 1.56
> 
> Now here are some timings in comparison this version with one where I 
> replaced calls to which() with a lookup table:
> 
> # ALT
> system.time(nd.alt <- alt_mrca(phy, tips)); nd.alt;
>    user  system elapsed 
>   1.552   0.004   1.555 
> [1] 100003
> > 
> # ALT FAST
> system.time(nd.alt <- alt_mrca_fast(phy, tips)); nd.alt;
>    user  system elapsed 
>   0.004   0.000   0.004 
> [1] 100003 
> 
> Not too bad for some pure R code. 
> And now you really need to start thinking of another excuse for a coffee 
> break. 
> 
> Cheers, 
> Klaus
> 
> PS: Emmanuel, just another function which could get into next version. 
> PPS: The plot should better have a logarithmic scale for time.
> PPPS: Here is the improved code, just 2 additional lines.
> 
> 
> alt_mrca_fast <- function (phy, tips) {
>     rootnd <- length(phy$tip.label) + 1;
>     pars <- numeric(); # parents in lineage of first queried tip
>     mrcaind <- 1;      # index in pars of the mrca node
>     tnd <- NULL;       # node ids of tips
>     if ("character" %in% class(tips)) {
>         tnd <- match(tips, phy$tip.label);
>     } else {
>         tnd <- tips;
>     }
>     
>     # build a lookup table to get parents faster
>     pvec <- integer(max(phy$edge))
>     pvec[phy$edge[,2]] = phy$edge[,1]
>     
>     # get entire lineage for first tip
>     nd <- tnd[1];
>     repeat {
> #        nd <- phy$edge[which(phy$edge[,2] == nd),1]; # get immediate parent
>         nd <- pvec[nd]
>         pars <- c(pars, nd);
>         if (nd == rootnd) {
>             break;
>         }
>     }
>     
>     
>     # traverse lineages for remaining tips, stop if hit common ancestor
>     for (i in 2:length(tnd)) {
>         done <- FALSE;
>         cnd <- tnd[i];
>         while (!done) {
> #            cpar <- phy$edge[which(phy$edge[,2] == cnd),1]; # get immediate 
> parent
>             cpar <- pvec[cnd]
>             if (cpar %in% pars) {
>                 if (cpar == rootnd) {
>                     # early exit! if the mrca of any set of taxa is the root 
> then we are done
>                     return(rootnd);
>                 }
>                 idx <- which(pars == cpar);
>                 if (idx > mrcaind) {
>                     mrcaind <- idx;
>                 }
>                 done <- TRUE;
>             } else {
>                 # keep going!
>                 cnd <- cpar;
>             }
>         }
>     }
>     return(pars[mrcaind]);
> }
> 
> 
> 
> 
> On Wed, Jun 7, 2017 at 4:24 PM, Joseph W. Brown <josep...@umich.edu 
> <mailto:josep...@umich.edu>> wrote:
> I've been working with some very large trees (tens to hundreds of thousands 
> of tips) where the speed of APE's getMRCA function has been prohibitively 
> slow. I therefore R-ified an MRCA function I developed in Java for use with 
> the Open Tree of Life tree (~3 million tips). I don't pretend to be an 
> algorithm guy, but this seems to work.
> 
> Given a tree `phy` and a set of query taxa `tips`, here is the algorithm 
> (note that the terminology has the root at the top of the tree):
> 1) For the first tip, drill up through the tree, recording its entire lineage 
> (parental nodes) to the root in a vector. This seems wasteful (i.e., if the 
> MRCA is not the root) but in practice is actually fast. Call this vector 
> `pars`.
> 
> Even computing all ancestors is actually not so slow:
> system.time(tmp <- Ancestors(phy, 1:100000))
>    user  system elapsed 
>   0.084   0.000   0.085 
> You need to traverse the whole tree only once, but you may run out of memory 
> for larger trees.
> 
> 
> 2) For each subsequent tip, traverse its lineage up in the tree until 
> encountering an ancestor in `pars`. 
> 3) Return the highest indexed node in `pars` that was traversed by the other 
> lineages.
> 
> A convenient "early exit" of this approach is that if any of the MRCAs are 
> the root then we can stop immediately. Another benefit with this is that the 
> query nodes do not necessarily have to be terminal taxa: MRCAs can be found 
> among terminal and internal nodes.
> 
> So, how does it fare? Pretty good! Here are some results on a 100K tip tree 
> (the tree I am actually working with has 350K tips).
> 
> # set up example
> set.seed(13);
> phy <- rtree(100000);
> node <- 100003; # make sure MRCA is not the root
> tt <- unlist(Descendants(phy, 100003));
> set.seed(13);
> tips <- sample(tt, 50);
> 
> # APE
> system.time(nd.ape <- getMRCA(phy, tips)); nd.ape;
>    user  system elapsed 
>  14.459   0.066  14.482 
> [1] 100003
> 
> # ALT
> system.time(nd.alt <- alt_mrca(phy, tips)); nd.alt;
>    user  system elapsed 
>   2.557   0.534   3.063 
> [1] 100003
> 
> Ok, that was for a shallow MRCA node (i.e. close to the root). How does it 
> fare for recent MRCAs? Here is a timing for sister tips:
> 
> # APE
> system.time(nd.ape <- getMRCA(phy, c(13,14))); nd.ape;
>    user  system elapsed 
>  14.715   0.063  14.727 
> [1] 100027
> 
> # ALT
> system.time(nd.alt <- alt_mrca(phy, c(13,14))); nd.alt;
>    user  system elapsed 
>   0.047   0.015   0.061
> [1] 100027
> 
> Way faster! Getting the entire lineage of the first tip does not seem to hurt 
> us.
> 
> We can also look how the timing scales as the size of the query set changes 
> (I've made sure that the MRCA is identical in every case):
> <Timing_tips.png>
> (APE is blue, alt_mrca is red). So the APE version seems insensitive to the 
> number of tips query. I suppose there must be a point where the timings cross 
> and the alt_mrca function takes more time, but I have not found it. Besides, 
> if the query set becomes large enough then the MRCA is likely the root 
> itself, and in that case we may exit early:
> 
> # APE with just 2 tips
> system.time(nd.ape <- getMRCA(phy, c(1,100000))); nd.ape;
>    user  system elapsed 
>  14.627   0.055  14.642 
> [1] 100001
> 
> # ALT with a random 500 tips
> system.time(nd.alt <- alt_mrca(phy, sample(1:100000, 500))); nd.alt;
>    user  system elapsed 
>   0.292   0.058   0.346 
> [1] 100001
> 
> Not bad. The code for the alternate MRCA is below. I've considered lapply-ing 
> this (i.e. across query nodes), but that would negate the potential early 
> exit (as all tip-to-root lineages would have to be traversed). Note that in 
> all of these timings above the APE getMRCA function uses C code but the 
> alternative is pure R. I imagine the alt_mrca could be made even speedier if 
> coded in C.
> 
> HTH.
> Joseph.
> 
> alt_mrca <- function (phy, tips) {
>     rootnd <- length(phy$tip.label) + 1;
>     pars <- numeric(); # parents in lineage of first queried tip
>     mrcaind <- 1;      # index in pars of the mrca node
>     tnd <- NULL;       # node ids of tips
>     if ("character" %in% class(tips)) {
>         tnd <- match(tips, phy$tip.label);
>     } else {
>         tnd <- tips;
>     }
>     
>     # get entire lineage for first tip
>     nd <- tnd[1];
>     repeat {
>         nd <- phy$edge[which(phy$edge[,2] == nd),1]; # get immediate parent
>         pars <- c(pars, nd);
>         if (nd == rootnd) {
>             break;
>         }
>     }
>     
>     # traverse lineages for remaining tips, stop if hit common ancestor
>     for (i in 2:length(tnd)) {
>         done <- FALSE;
>         cnd <- tnd[i];
>         while (!done) {
>             cpar <- phy$edge[which(phy$edge[,2] == cnd),1]; # get immediate 
> parent
>             if (cpar %in% pars) {
>                 if (cpar == rootnd) {
>                     # early exit! if the mrca of any set of taxa is the root 
> then we are done
>                     return(rootnd);
>                 }
>                 idx <- which(pars == cpar);
>                 if (idx > mrcaind) {
>                     mrcaind <- idx;
>                 }
>                 done <- TRUE;
>             } else {
>                 # keep going!
>                 cnd <- cpar;
>             }
>         }
>     }
>     return(pars[mrcaind]);
> }
> 
> ________________________________________
> Joseph W. Brown
> Post-doctoral Researcher, Smith Laboratory
> University of Michigan
> Department of Ecology & Evolutionary Biology
> Room 2071, Kraus Natural Sciences Building
> Ann Arbor MI 48109-1079
> josep...@umich.edu <mailto:josep...@umich.edu>
> 
> 
> 
> 
> _______________________________________________
> R-sig-phylo mailing list - R-sig-phylo@r-project.org 
> <mailto:R-sig-phylo@r-project.org>
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo 
> <https://stat.ethz.ch/mailman/listinfo/r-sig-phylo>
> Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/ 
> <http://www.mail-archive.com/r-sig-phylo@r-project.org/>
> 
> 
> 
> -- 
> Klaus Schliep
> Postdoctoral Fellow
> Revell Lab, University of Massachusetts Boston
> http://www.phangorn.org/ <http://www.phangorn.org/>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Reply via email to