Re: [R-sig-phylo] identifying phylogenetically equivalent nodes

Eliot Miller Thu, 18 Feb 2021 12:18:13 -0800

Assuming trees are rooted and fully bifurcating, I think the following will
work. I tried a few tests and it seemed to work. I traded a few emails back
and forth with Jake to make sure I understood what he wanted. Guess what I
was supposed to be working on wasn't holding my attention...


Best,
Eliot

library(geiger)

first <- sim.bdtree(n=50)
second <- sim.bdtree(n=1000)
third <- sim.bdtree(n=2000)

tableSetup <- function(tree1, tree2)
{
  output <-
data.frame(matrix(nrow=min(c(length(tree1$tip.label),length(tree2$tip.label)))-1,
ncol=2))
  names(output) <- c("tree1", "tree2")
  output
}

#least inclusive=if there are multiple nodes in tree 2 which could
potentially be mapped to
#a node in tree 1, do you want to keep the smallest or the largest one? for
example:
#((A,B),Z) vs (((A,B),Z),C), where C doesn't occur in tree 1. Do you want
the matched node
#from tree 2 to be A,B,Z or A,B,Z,C.
equivCut <- function(tree1, tree2, results, least.inclusive)
{
  #identify nodes that appear more than once
  dups <- unique(results[,2][duplicated(results[,2])])

  #if there aren't any, return the original results
  if(length(dups)==0)
  {
    return(results)
  }

  #loop through, add details on size of clade to data frame
  for(i in 1:length(dups))
  {
    #set aside all the results that pertain to that node
    setAside <- results[results[,2]==dups[i],]

    #go through each node and figure out how many taxa descend from that
node
    temp <- data.frame(matrix(nrow=dim(setAside)[1], ncol=2))
    names(temp) <- c("node","no.taxa")
    for(j in 1:dim(setAside)[1])
    {
      temp[j,"node"] <- setAside[j,"tree1"]
      temp[j,"no.taxa"] <- length(extract.clade(tree1,
setAside[j,"tree1"])$tip.label)
    }

    #now you can decide which of these to keep
    if(least.inclusive)
    {
      keep <- temp$node[temp$no.taxa==min(temp$no.taxa)]
      toDrop <- setdiff(temp$node, keep)
      results <- results[!(results$tree1 %in% toDrop),]
    }

    else
    {
      keep <- temp$node[temp$no.taxa==max(temp$no.taxa)]
      toDrop <- setdiff(temp$node, keep)
      results <- results[!(results$tree1 %in% toDrop),]
    }
  }

  results
}

idMatches <- function(tree1, tree2, least.inclusive=TRUE)
{
  #set the results table up
  results <- tableSetup(tree1, tree2)

  #define tree1 nodes here
  nodes1 <- (length(tree1$tip.label)+1):max(tree1$edge)

  for(i in 1:length(nodes1))
  {
    #set the correct row in the tree1 column to the node in question
    results[i,1] <- nodes1[i]

    #find the descendants of this node
    tips1 <- tips(tree1, nodes1[i])

    #drop these to just tips that are in tree 2
    tips1 <- tips1[tips1 %in% tree2$tip.label]

    #if there's nothing left, set to no match and move on
    if(length(tips1)==0)
    {
      results[i,2] <- "no.match"
      next()
    }

    #find the MRCA of those tips in tree 2
    #if there's only a single taxon left, pull the node it descends from
(getMRCA will fail)
    if(length(tips1==1))
    {
      edge2 <- which.edge(tree2, tips1)
      mrca2 <- tree2$edge[edge2,][1]
    }
    else
    {
      mrca2 <- getMRCA(tree2, tips1)
    }

    #find the descendants of that node
    tips2 <- tips(tree2, mrca2)

    #drop to just taxa that are in tree1
    tips2 <- tips2[tips2 %in% tree1$tip.label]

    #if these tips are equivalent, the nodes match
    if(setequal(tips1, tips2))
    {
      results[i,2] <- mrca2
    }

    else
    {
      results[i,2] <- "no.match"
    }
  }

  #consider what you want to do next. do you want to, e.g., add a row for
  #every not-yet-mentioned tree2 node and add "no.match" to the first
column,
  #or do you want to return as is, or do you want to only return matches?
for
  #now, drop to only matches and re-class both as integer so can demonstrate
  #equality no matter which tree is first or second
  results <- results[results[,2] != "no.match",]
  results[,2] <- as.integer(results[,2])

  #run the equivCut function
  results <- equivCut(tree1, tree2, results, least.inclusive)
  results
}

#always returns equivalent no matter which tree is first if you set it to
least inclusive = TRUE
system.time(test <- idMatches(first,second))
system.time(test2 <- idMatches(second,first))
setequal(test[,1], test2[,2])
setequal(test[,2], test2[,1])

#always returns equivalent no matter which tree is first if you set it to
least inclusive = TRUE
system.time(test3 <- idMatches(second, third))
system.time(test4 <- idMatches(third, second))
setequal(test3[,1], test4[,2])
setequal(test3[,2], test4[,1])

#here's how you might compare a set of trees to a single reference one
ref <- sim.bdtree(n=100)
target1 <- sim.bdtree(n=78)
target2 <- sim.bdtree(n=86)

vs1 <- idMatches(ref,target1)
vs2 <- idMatches(ref,target2)
names(vs1) <- c("ref","target1")
names(vs2) <- c("ref","target2")

final <- data.frame(ref=unique(c(vs1$ref, vs2$ref)))
final <- merge(final, vs1, all.x=TRUE)
final <- merge(final, vs2, all.x=TRUE)
final

On Thu, Feb 18, 2021 at 8:56 AM Emmanuel Paradis <emmanuel.para...@ird.fr>
wrote:

> Hi Jacob,
>
> ape::makeNodeLabel(phy, method = "md5sum") returns 'phy' with node labels
> that depend on the tips descendant from each node. For instance:
>
> tr3 <- makeNodeLabel(rtree(3), m = "m")
> tr4 <- makeNodeLabel(rtree(4), m = "m")
> any(tr3$node.label %in% tr4$node.label)
>
> If you repeat these 3 commands several times, you should have ~20% of
> TRUE. In your case, match() should make more sense.
>
> Also, I suppose your trees are rooted. If they are unrooted, you should
> consider using splits (or root them).
>
> Best,
>
> Emmanuel
>
> ----- Le 18 Fév 21, à 0:59, Jacob Berv jakeberv.r.sig.ph...@gmail.com a
> écrit :
> > Dear R-sig-phylo,
> >
> > Over the weekend, I asked Liam Revell if he had a solution to use
> matchNodes for
> > a particular problem I’m trying to solve—finding all phylogenetically
> > equivalent nodes when comparing trees that have uneven taxon samples and
> > different topologies. Liam was kind enough to take some time to write a
> blog
> > post about this, and got me started with some code
> >
> >
> http://blog.phytools.org/2021/02/on-matching-nodes-between-trees-using.html
> >
> > On it’s face this seems like a simple problem, but I’m running into some
> issues
> > and thought I would reach out to the broader group. The code linked
> above seems
> > to work, but only for comparing trees that start out as topologically
> > identical. For my purposes, I’m trying to match nodes from a given a
> reference,
> > to nodes in and across several hundred gene trees that differ in
> topology and
> > taxon sample relative to the reference.
> >
> > Here is a function definition based on Liam’s example
> >
> > #function to match nodes from consensus
> > #to individual gene trees with uneven sampling
> > #derived from Liam Revell's example-- need to
> > testmatch_phylo_nodes<-function(t1, t2){
> >  ## step one drop tips
> >  t1p<-drop.tip(t1,setdiff(t1$tip.label, t2$tip.label))
> >  t2p<-drop.tip(t2,setdiff(t2 $tip.label, t1$tip.label))
> >
> >  ## step two match nodes "descendants"
> >  M<-matchNodes(t1p,t2p)
> >
> >  ## step two match nodes "distances"
> >  M1<-matchNodes(t1,t1p,"distances")
> >  M2<-matchNodes(t2,t2p,"distances")
> >
> >  ## final step, reconcile
> >  MM<-matrix(NA,t1$Nnode,2,dimnames=list(NULL,c("left","right")))
> >
> >  for(i in 1:nrow(MM)){
> >    MM[i,1]<-M1[i,1]
> >    nn<-M[which(M[,1]==M1[i,2]),2]
> >    if(length(nn)>0){
> >        MM[i,2]<-M2[which(M2[,2]==nn),1]
> >    }
> >  }
> >  return(MM)
> > }
> >
> >
> > When t1 and t2 are trees that have topological conflicts, this function
> returns
> > an error:
> >
> > Error in MM[i, 2] <- M2[which(M2[, 2] == nn), 1] :
> >  replacement has length zero
> >
> > I think(?) this happens because a particular node doesn’t exist in one
> or the
> > other trees, and it returns integer(0) at that line — but I’m not sure I
> really
> > understand what is going on here.
> >
> >
> > I modified Liam’s code slightly to get it to run without error in the
> described
> > case, by making it conditional on that particular line:
> >
> >
> > Modified version
> >
> > #function to match nodes from consensus
> > #to individual gene trees with uneven sampling
> > #derived from Liam Revell's example-- need to test
> > match_phylo_nodes<-function(t1, t2){
> >       ## step one drop tips
> >       t1p<-drop.tip(t1,setdiff(t1$tip.label, t2$tip.label))
> >       t2p<-drop.tip(t2,setdiff(t2 $tip.label, t1$tip.label))
> >
> >       ## step two match nodes "descendants"
> >       M<-matchNodes(t1p,t2p)
> >
> >       ## step two match nodes "distances"
> >       M1<-matchNodes(t1,t1p,"distances")
> >       M2<-matchNodes(t2,t2p,"distances")
> >
> >       ## final step, reconcile
> >       MM<-matrix(NA,t1$Nnode,2,dimnames=list(NULL,c("left","right")))
> >
> >       for(i in 1:nrow(MM)){
> >               MM[i,1]<-M1[i,1]
> >       nn<-M[which(M[,1]==M1[i,2]),2]
> >    if(length(nn)>0){
> >       if(length(which(M2[,2]==nn))>0){
> >               MM[i,2]<-M2[which(M2[,2]==nn),1]
> >       }
> >    } else {
> >    }
> > }
> > return(MM)
> > }
> >
> >
> > I’ve been experimenting with this and some downstream code for the last
> few
> > days, but I’ve run into some weird inconsistent results (not easily
> summarized)
> > that make me think that this function is not working as intended.
> >
> > I was wondering — have any of you dealt with a similar problem? In
> principle
> > this seems like it should be similar to concordance analysis, but I care
> less
> > about identifying the proportion of nodes that exist in gene trees given
> a
> > reference, and instead I need the actual node numbers in a given gene
> tree that
> > are phylogenetically equivalent to particular nodes in a reference.
> Happy to
> > try to hack away at something…
> >
> >
> > Best,
> > Jake Berv
> >
> >
> >
> >
> >
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > R-sig-phylo mailing list - R-sig-phylo@r-project.org
> > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> > Searchable archive at
> http://www.mail-archive.com/r-sig-phylo@r-project.org/
>
> _______________________________________________
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at
> http://www.mail-archive.com/r-sig-phylo@r-project.org/
>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] identifying phylogenetically equivalent nodes

Reply via email to