Assuming trees are rooted and fully bifurcating, I think the following will work. I tried a few tests and it seemed to work. I traded a few emails back and forth with Jake to make sure I understood what he wanted. Guess what I was supposed to be working on wasn't holding my attention...
Best, Eliot library(geiger) first <- sim.bdtree(n=50) second <- sim.bdtree(n=1000) third <- sim.bdtree(n=2000) tableSetup <- function(tree1, tree2) { output <- data.frame(matrix(nrow=min(c(length(tree1$tip.label),length(tree2$tip.label)))-1, ncol=2)) names(output) <- c("tree1", "tree2") output } #least inclusive=if there are multiple nodes in tree 2 which could potentially be mapped to #a node in tree 1, do you want to keep the smallest or the largest one? for example: #((A,B),Z) vs (((A,B),Z),C), where C doesn't occur in tree 1. Do you want the matched node #from tree 2 to be A,B,Z or A,B,Z,C. equivCut <- function(tree1, tree2, results, least.inclusive) { #identify nodes that appear more than once dups <- unique(results[,2][duplicated(results[,2])]) #if there aren't any, return the original results if(length(dups)==0) { return(results) } #loop through, add details on size of clade to data frame for(i in 1:length(dups)) { #set aside all the results that pertain to that node setAside <- results[results[,2]==dups[i],] #go through each node and figure out how many taxa descend from that node temp <- data.frame(matrix(nrow=dim(setAside)[1], ncol=2)) names(temp) <- c("node","no.taxa") for(j in 1:dim(setAside)[1]) { temp[j,"node"] <- setAside[j,"tree1"] temp[j,"no.taxa"] <- length(extract.clade(tree1, setAside[j,"tree1"])$tip.label) } #now you can decide which of these to keep if(least.inclusive) { keep <- temp$node[temp$no.taxa==min(temp$no.taxa)] toDrop <- setdiff(temp$node, keep) results <- results[!(results$tree1 %in% toDrop),] } else { keep <- temp$node[temp$no.taxa==max(temp$no.taxa)] toDrop <- setdiff(temp$node, keep) results <- results[!(results$tree1 %in% toDrop),] } } results } idMatches <- function(tree1, tree2, least.inclusive=TRUE) { #set the results table up results <- tableSetup(tree1, tree2) #define tree1 nodes here nodes1 <- (length(tree1$tip.label)+1):max(tree1$edge) for(i in 1:length(nodes1)) { #set the correct row in the tree1 column to the node in question results[i,1] <- nodes1[i] #find the descendants of this node tips1 <- tips(tree1, nodes1[i]) #drop these to just tips that are in tree 2 tips1 <- tips1[tips1 %in% tree2$tip.label] #if there's nothing left, set to no match and move on if(length(tips1)==0) { results[i,2] <- "no.match" next() } #find the MRCA of those tips in tree 2 #if there's only a single taxon left, pull the node it descends from (getMRCA will fail) if(length(tips1==1)) { edge2 <- which.edge(tree2, tips1) mrca2 <- tree2$edge[edge2,][1] } else { mrca2 <- getMRCA(tree2, tips1) } #find the descendants of that node tips2 <- tips(tree2, mrca2) #drop to just taxa that are in tree1 tips2 <- tips2[tips2 %in% tree1$tip.label] #if these tips are equivalent, the nodes match if(setequal(tips1, tips2)) { results[i,2] <- mrca2 } else { results[i,2] <- "no.match" } } #consider what you want to do next. do you want to, e.g., add a row for #every not-yet-mentioned tree2 node and add "no.match" to the first column, #or do you want to return as is, or do you want to only return matches? for #now, drop to only matches and re-class both as integer so can demonstrate #equality no matter which tree is first or second results <- results[results[,2] != "no.match",] results[,2] <- as.integer(results[,2]) #run the equivCut function results <- equivCut(tree1, tree2, results, least.inclusive) results } #always returns equivalent no matter which tree is first if you set it to least inclusive = TRUE system.time(test <- idMatches(first,second)) system.time(test2 <- idMatches(second,first)) setequal(test[,1], test2[,2]) setequal(test[,2], test2[,1]) #always returns equivalent no matter which tree is first if you set it to least inclusive = TRUE system.time(test3 <- idMatches(second, third)) system.time(test4 <- idMatches(third, second)) setequal(test3[,1], test4[,2]) setequal(test3[,2], test4[,1]) #here's how you might compare a set of trees to a single reference one ref <- sim.bdtree(n=100) target1 <- sim.bdtree(n=78) target2 <- sim.bdtree(n=86) vs1 <- idMatches(ref,target1) vs2 <- idMatches(ref,target2) names(vs1) <- c("ref","target1") names(vs2) <- c("ref","target2") final <- data.frame(ref=unique(c(vs1$ref, vs2$ref))) final <- merge(final, vs1, all.x=TRUE) final <- merge(final, vs2, all.x=TRUE) final On Thu, Feb 18, 2021 at 8:56 AM Emmanuel Paradis <emmanuel.para...@ird.fr> wrote: > Hi Jacob, > > ape::makeNodeLabel(phy, method = "md5sum") returns 'phy' with node labels > that depend on the tips descendant from each node. For instance: > > tr3 <- makeNodeLabel(rtree(3), m = "m") > tr4 <- makeNodeLabel(rtree(4), m = "m") > any(tr3$node.label %in% tr4$node.label) > > If you repeat these 3 commands several times, you should have ~20% of > TRUE. In your case, match() should make more sense. > > Also, I suppose your trees are rooted. If they are unrooted, you should > consider using splits (or root them). > > Best, > > Emmanuel > > ----- Le 18 Fév 21, à 0:59, Jacob Berv jakeberv.r.sig.ph...@gmail.com a > écrit : > > Dear R-sig-phylo, > > > > Over the weekend, I asked Liam Revell if he had a solution to use > matchNodes for > > a particular problem I’m trying to solve—finding all phylogenetically > > equivalent nodes when comparing trees that have uneven taxon samples and > > different topologies. Liam was kind enough to take some time to write a > blog > > post about this, and got me started with some code > > > > > http://blog.phytools.org/2021/02/on-matching-nodes-between-trees-using.html > > > > On it’s face this seems like a simple problem, but I’m running into some > issues > > and thought I would reach out to the broader group. The code linked > above seems > > to work, but only for comparing trees that start out as topologically > > identical. For my purposes, I’m trying to match nodes from a given a > reference, > > to nodes in and across several hundred gene trees that differ in > topology and > > taxon sample relative to the reference. > > > > Here is a function definition based on Liam’s example > > > > #function to match nodes from consensus > > #to individual gene trees with uneven sampling > > #derived from Liam Revell's example-- need to > > testmatch_phylo_nodes<-function(t1, t2){ > > ## step one drop tips > > t1p<-drop.tip(t1,setdiff(t1$tip.label, t2$tip.label)) > > t2p<-drop.tip(t2,setdiff(t2 $tip.label, t1$tip.label)) > > > > ## step two match nodes "descendants" > > M<-matchNodes(t1p,t2p) > > > > ## step two match nodes "distances" > > M1<-matchNodes(t1,t1p,"distances") > > M2<-matchNodes(t2,t2p,"distances") > > > > ## final step, reconcile > > MM<-matrix(NA,t1$Nnode,2,dimnames=list(NULL,c("left","right"))) > > > > for(i in 1:nrow(MM)){ > > MM[i,1]<-M1[i,1] > > nn<-M[which(M[,1]==M1[i,2]),2] > > if(length(nn)>0){ > > MM[i,2]<-M2[which(M2[,2]==nn),1] > > } > > } > > return(MM) > > } > > > > > > When t1 and t2 are trees that have topological conflicts, this function > returns > > an error: > > > > Error in MM[i, 2] <- M2[which(M2[, 2] == nn), 1] : > > replacement has length zero > > > > I think(?) this happens because a particular node doesn’t exist in one > or the > > other trees, and it returns integer(0) at that line — but I’m not sure I > really > > understand what is going on here. > > > > > > I modified Liam’s code slightly to get it to run without error in the > described > > case, by making it conditional on that particular line: > > > > > > Modified version > > > > #function to match nodes from consensus > > #to individual gene trees with uneven sampling > > #derived from Liam Revell's example-- need to test > > match_phylo_nodes<-function(t1, t2){ > > ## step one drop tips > > t1p<-drop.tip(t1,setdiff(t1$tip.label, t2$tip.label)) > > t2p<-drop.tip(t2,setdiff(t2 $tip.label, t1$tip.label)) > > > > ## step two match nodes "descendants" > > M<-matchNodes(t1p,t2p) > > > > ## step two match nodes "distances" > > M1<-matchNodes(t1,t1p,"distances") > > M2<-matchNodes(t2,t2p,"distances") > > > > ## final step, reconcile > > MM<-matrix(NA,t1$Nnode,2,dimnames=list(NULL,c("left","right"))) > > > > for(i in 1:nrow(MM)){ > > MM[i,1]<-M1[i,1] > > nn<-M[which(M[,1]==M1[i,2]),2] > > if(length(nn)>0){ > > if(length(which(M2[,2]==nn))>0){ > > MM[i,2]<-M2[which(M2[,2]==nn),1] > > } > > } else { > > } > > } > > return(MM) > > } > > > > > > I’ve been experimenting with this and some downstream code for the last > few > > days, but I’ve run into some weird inconsistent results (not easily > summarized) > > that make me think that this function is not working as intended. > > > > I was wondering — have any of you dealt with a similar problem? In > principle > > this seems like it should be similar to concordance analysis, but I care > less > > about identifying the proportion of nodes that exist in gene trees given > a > > reference, and instead I need the actual node numbers in a given gene > tree that > > are phylogenetically equivalent to particular nodes in a reference. > Happy to > > try to hack away at something… > > > > > > Best, > > Jake Berv > > > > > > > > > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > R-sig-phylo mailing list - R-sig-phylo@r-project.org > > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo > > Searchable archive at > http://www.mail-archive.com/r-sig-phylo@r-project.org/ > > _______________________________________________ > R-sig-phylo mailing list - R-sig-phylo@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo > Searchable archive at > http://www.mail-archive.com/r-sig-phylo@r-project.org/ > [[alternative HTML version deleted]] _______________________________________________ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/