Re: [R-sig-phylo] identifying phylogenetically equivalent nodes
Assuming trees are rooted and fully bifurcating, I think the following will work. I tried a few tests and it seemed to work. I traded a few emails back and forth with Jake to make sure I understood what he wanted. Guess what I was supposed to be working on wasn't holding my attention... Best, Eliot library(geiger) first <- sim.bdtree(n=50) second <- sim.bdtree(n=1000) third <- sim.bdtree(n=2000) tableSetup <- function(tree1, tree2) { output <- data.frame(matrix(nrow=min(c(length(tree1$tip.label),length(tree2$tip.label)))-1, ncol=2)) names(output) <- c("tree1", "tree2") output } #least inclusive=if there are multiple nodes in tree 2 which could potentially be mapped to #a node in tree 1, do you want to keep the smallest or the largest one? for example: #((A,B),Z) vs (((A,B),Z),C), where C doesn't occur in tree 1. Do you want the matched node #from tree 2 to be A,B,Z or A,B,Z,C. equivCut <- function(tree1, tree2, results, least.inclusive) { #identify nodes that appear more than once dups <- unique(results[,2][duplicated(results[,2])]) #if there aren't any, return the original results if(length(dups)==0) { return(results) } #loop through, add details on size of clade to data frame for(i in 1:length(dups)) { #set aside all the results that pertain to that node setAside <- results[results[,2]==dups[i],] #go through each node and figure out how many taxa descend from that node temp <- data.frame(matrix(nrow=dim(setAside)[1], ncol=2)) names(temp) <- c("node","no.taxa") for(j in 1:dim(setAside)[1]) { temp[j,"node"] <- setAside[j,"tree1"] temp[j,"no.taxa"] <- length(extract.clade(tree1, setAside[j,"tree1"])$tip.label) } #now you can decide which of these to keep if(least.inclusive) { keep <- temp$node[temp$no.taxa==min(temp$no.taxa)] toDrop <- setdiff(temp$node, keep) results <- results[!(results$tree1 %in% toDrop),] } else { keep <- temp$node[temp$no.taxa==max(temp$no.taxa)] toDrop <- setdiff(temp$node, keep) results <- results[!(results$tree1 %in% toDrop),] } } results } idMatches <- function(tree1, tree2, least.inclusive=TRUE) { #set the results table up results <- tableSetup(tree1, tree2) #define tree1 nodes here nodes1 <- (length(tree1$tip.label)+1):max(tree1$edge) for(i in 1:length(nodes1)) { #set the correct row in the tree1 column to the node in question results[i,1] <- nodes1[i] #find the descendants of this node tips1 <- tips(tree1, nodes1[i]) #drop these to just tips that are in tree 2 tips1 <- tips1[tips1 %in% tree2$tip.label] #if there's nothing left, set to no match and move on if(length(tips1)==0) { results[i,2] <- "no.match" next() } #find the MRCA of those tips in tree 2 #if there's only a single taxon left, pull the node it descends from (getMRCA will fail) if(length(tips1==1)) { edge2 <- which.edge(tree2, tips1) mrca2 <- tree2$edge[edge2,][1] } else { mrca2 <- getMRCA(tree2, tips1) } #find the descendants of that node tips2 <- tips(tree2, mrca2) #drop to just taxa that are in tree1 tips2 <- tips2[tips2 %in% tree1$tip.label] #if these tips are equivalent, the nodes match if(setequal(tips1, tips2)) { results[i,2] <- mrca2 } else { results[i,2] <- "no.match" } } #consider what you want to do next. do you want to, e.g., add a row for #every not-yet-mentioned tree2 node and add "no.match" to the first column, #or do you want to return as is, or do you want to only return matches? for #now, drop to only matches and re-class both as integer so can demonstrate #equality no matter which tree is first or second results <- results[results[,2] != "no.match",] results[,2] <- as.integer(results[,2]) #run the equivCut function results <- equivCut(tree1, tree2, results, least.inclusive) results } #always returns equivalent no matter which tree is first if you set it to least inclusive = TRUE system.time(test <- idMatches(first,second)) system.time(test2 <- idMatches(second,first)) setequal(test[,1], test2[,2]) setequal(test[,2], test2[,1]) #always returns equivalent no matter which tree is first if you set it to least inclusive = TRUE system.time(test3 <- idMatches(second, third)) system.time(test4 <- idMatches(third, second)) setequal(test3[,1], test4[,2]) setequal(test3[,2], test4[,1]) #here's how you might compare a set of trees to a single reference one ref <- sim.bdtree(n=100) target1 <- sim.bdtree(n=78) target2 <- sim.bdtree(n=86) vs1 <- idMatches(ref,target1) vs2 <- idMatches(ref,target2) names(vs1) <- c("ref","target1") names(vs2) <- c("ref","target2") final <- data.frame(ref=unique(c(vs1$ref, vs2$ref))) final <- merge(final, vs1, all.x=TRUE) final <- merge(final, vs2, all.x=TRUE) final On Thu, Feb 18, 2021 at 8:56 AM Emmanuel Paradis wrote: > Hi
Re: [R-sig-phylo] identifying phylogenetically equivalent nodes
Hi Jacob, ape::makeNodeLabel(phy, method = "md5sum") returns 'phy' with node labels that depend on the tips descendant from each node. For instance: tr3 <- makeNodeLabel(rtree(3), m = "m") tr4 <- makeNodeLabel(rtree(4), m = "m") any(tr3$node.label %in% tr4$node.label) If you repeat these 3 commands several times, you should have ~20% of TRUE. In your case, match() should make more sense. Also, I suppose your trees are rooted. If they are unrooted, you should consider using splits (or root them). Best, Emmanuel - Le 18 Fév 21, à 0:59, Jacob Berv jakeberv.r.sig.ph...@gmail.com a écrit : > Dear R-sig-phylo, > > Over the weekend, I asked Liam Revell if he had a solution to use matchNodes > for > a particular problem I’m trying to solve—finding all phylogenetically > equivalent nodes when comparing trees that have uneven taxon samples and > different topologies. Liam was kind enough to take some time to write a blog > post about this, and got me started with some code > > http://blog.phytools.org/2021/02/on-matching-nodes-between-trees-using.html > > On it’s face this seems like a simple problem, but I’m running into some > issues > and thought I would reach out to the broader group. The code linked above > seems > to work, but only for comparing trees that start out as topologically > identical. For my purposes, I’m trying to match nodes from a given a > reference, > to nodes in and across several hundred gene trees that differ in topology and > taxon sample relative to the reference. > > Here is a function definition based on Liam’s example > > #function to match nodes from consensus > #to individual gene trees with uneven sampling > #derived from Liam Revell's example-- need to > testmatch_phylo_nodes<-function(t1, t2){ > ## step one drop tips > t1p<-drop.tip(t1,setdiff(t1$tip.label, t2$tip.label)) > t2p<-drop.tip(t2,setdiff(t2 $tip.label, t1$tip.label)) > > ## step two match nodes "descendants" > M<-matchNodes(t1p,t2p) > > ## step two match nodes "distances" > M1<-matchNodes(t1,t1p,"distances") > M2<-matchNodes(t2,t2p,"distances") > > ## final step, reconcile > MM<-matrix(NA,t1$Nnode,2,dimnames=list(NULL,c("left","right"))) > > for(i in 1:nrow(MM)){ >MM[i,1]<-M1[i,1] >nn<-M[which(M[,1]==M1[i,2]),2] >if(length(nn)>0){ >MM[i,2]<-M2[which(M2[,2]==nn),1] >} > } > return(MM) > } > > > When t1 and t2 are trees that have topological conflicts, this function > returns > an error: > > Error in MM[i, 2] <- M2[which(M2[, 2] == nn), 1] : > replacement has length zero > > I think(?) this happens because a particular node doesn’t exist in one or the > other trees, and it returns integer(0) at that line — but I’m not sure I > really > understand what is going on here. > > > I modified Liam’s code slightly to get it to run without error in the > described > case, by making it conditional on that particular line: > > > Modified version > > #function to match nodes from consensus > #to individual gene trees with uneven sampling > #derived from Liam Revell's example-- need to test > match_phylo_nodes<-function(t1, t2){ > ## step one drop tips > t1p<-drop.tip(t1,setdiff(t1$tip.label, t2$tip.label)) > t2p<-drop.tip(t2,setdiff(t2 $tip.label, t1$tip.label)) > > ## step two match nodes "descendants" > M<-matchNodes(t1p,t2p) > > ## step two match nodes "distances" > M1<-matchNodes(t1,t1p,"distances") > M2<-matchNodes(t2,t2p,"distances") > > ## final step, reconcile > MM<-matrix(NA,t1$Nnode,2,dimnames=list(NULL,c("left","right"))) > > for(i in 1:nrow(MM)){ > MM[i,1]<-M1[i,1] > nn<-M[which(M[,1]==M1[i,2]),2] >if(length(nn)>0){ > if(length(which(M2[,2]==nn))>0){ > MM[i,2]<-M2[which(M2[,2]==nn),1] > } >} else { >} > } > return(MM) > } > > > I’ve been experimenting with this and some downstream code for the last few > days, but I’ve run into some weird inconsistent results (not easily > summarized) > that make me think that this function is not working as intended. > > I was wondering — have any of you dealt with a similar problem? In principle > this seems like it should be similar to concordance analysis, but I care less > about identifying the proportion of nodes that exist in gene trees given a > reference, and instead I need the actual node numbers in a given gene tree > that > are phylogenetically equivalent to particular nodes in a reference. Happy to > try to hack away at something… > > > Best, > Jake Berv > > > > > > [[alternative HTML version deleted]] > > ___ > R-sig-phylo mailing list - R-sig-phylo@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo > Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/ ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz
[R-sig-phylo] identifying phylogenetically equivalent nodes
Dear R-sig-phylo, Over the weekend, I asked Liam Revell if he had a solution to use matchNodes for a particular problem I’m trying to solve—finding all phylogenetically equivalent nodes when comparing trees that have uneven taxon samples and different topologies. Liam was kind enough to take some time to write a blog post about this, and got me started with some code http://blog.phytools.org/2021/02/on-matching-nodes-between-trees-using.html On it’s face this seems like a simple problem, but I’m running into some issues and thought I would reach out to the broader group. The code linked above seems to work, but only for comparing trees that start out as topologically identical. For my purposes, I’m trying to match nodes from a given a reference, to nodes in and across several hundred gene trees that differ in topology and taxon sample relative to the reference. Here is a function definition based on Liam’s example #function to match nodes from consensus #to individual gene trees with uneven sampling #derived from Liam Revell's example-- need to testmatch_phylo_nodes<-function(t1, t2){ ## step one drop tips t1p<-drop.tip(t1,setdiff(t1$tip.label, t2$tip.label)) t2p<-drop.tip(t2,setdiff(t2 $tip.label, t1$tip.label)) ## step two match nodes "descendants" M<-matchNodes(t1p,t2p) ## step two match nodes "distances" M1<-matchNodes(t1,t1p,"distances") M2<-matchNodes(t2,t2p,"distances") ## final step, reconcile MM<-matrix(NA,t1$Nnode,2,dimnames=list(NULL,c("left","right"))) for(i in 1:nrow(MM)){ MM[i,1]<-M1[i,1] nn<-M[which(M[,1]==M1[i,2]),2] if(length(nn)>0){ MM[i,2]<-M2[which(M2[,2]==nn),1] } } return(MM) } When t1 and t2 are trees that have topological conflicts, this function returns an error: Error in MM[i, 2] <- M2[which(M2[, 2] == nn), 1] : replacement has length zero I think(?) this happens because a particular node doesn’t exist in one or the other trees, and it returns integer(0) at that line — but I’m not sure I really understand what is going on here. I modified Liam’s code slightly to get it to run without error in the described case, by making it conditional on that particular line: Modified version #function to match nodes from consensus #to individual gene trees with uneven sampling #derived from Liam Revell's example-- need to test match_phylo_nodes<-function(t1, t2){ ## step one drop tips t1p<-drop.tip(t1,setdiff(t1$tip.label, t2$tip.label)) t2p<-drop.tip(t2,setdiff(t2 $tip.label, t1$tip.label)) ## step two match nodes "descendants" M<-matchNodes(t1p,t2p) ## step two match nodes "distances" M1<-matchNodes(t1,t1p,"distances") M2<-matchNodes(t2,t2p,"distances") ## final step, reconcile MM<-matrix(NA,t1$Nnode,2,dimnames=list(NULL,c("left","right"))) for(i in 1:nrow(MM)){ MM[i,1]<-M1[i,1] nn<-M[which(M[,1]==M1[i,2]),2] if(length(nn)>0){ if(length(which(M2[,2]==nn))>0){ MM[i,2]<-M2[which(M2[,2]==nn),1] } } else { } } return(MM) } I’ve been experimenting with this and some downstream code for the last few days, but I’ve run into some weird inconsistent results (not easily summarized) that make me think that this function is not working as intended. I was wondering — have any of you dealt with a similar problem? In principle this seems like it should be similar to concordance analysis, but I care less about identifying the proportion of nodes that exist in gene trees given a reference, and instead I need the actual node numbers in a given gene tree that are phylogenetically equivalent to particular nodes in a reference. Happy to try to hack away at something… Best, Jake Berv [[alternative HTML version deleted]] ___ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/