I have found that ".compressTipLabel" can take quite a while for larger tree distributions. By vectorizing the for loop, I've got substantial speed improvements. For really big distributions, multithreading definitely helps (but does come with a memory cost).
Code is below. HTH. Joseph. # rearrange tip labels so they are consistent with some ordering, either first tree or from 'refTree' # taken from ape:::.compressTipLabel # changes: 1) vectorize for loop, 2) add multicore option manageTipLabels <- function (phy, refTree=NULL, mc=FALSE, numCores=NULL) { if (mc) check.multicore(); cat("\nManaging tip label ordering across trees..."); if (!is.null(attr(phy, "TipLabel"))) return(phy); if (is.null(refTree)) ref <- phy[[1]]$tip.label else ref <- refTree$tip.label; if (any(table(ref) != 1)) stop("some tip labels are duplicated in tree no. 1"); n <- length(ref); Ntree <- length(phy); relabel <- function (y) { label <- y$tip.label; if (!identical(label, ref)) { if (length(label) != length(ref)) stop(paste("tree ", y, "has a different number of tips")); ilab <- match(label, ref); if (any(is.na(ilab))) stop(paste("tree ", y, "has different tip labels")); ie <- match(1:n, y$edge[, 2]); y$edge[ie, 2] <- ilab; } y$tip.label <- NULL; y; } if (mc) phy <- mclapply(phy, relabel, mc.cores=numCores) else phy <- lapply(phy, relabel); attr(phy, "TipLabel") <- ref; class(phy) <- "multiPhylo"; cat(" done.\n\n"); return(phy); } # Windows or GUI == no es bueno check.multicore <- function () { if (Sys.info()["sysname"] == "Windows") { # no soup for you! stop("\"mc\" argument requested, but package \"multicore\" is not available for Windows. Stopping.\n"); } if (!is.na(Sys.getenv()["R_GUI_APP_VERSION"])) { # or you! stop("\"mc\" argument requested, but package \"multicore\" cannot be run in a GUI environment. Stopping.\n"); } tmp = rownames(installed.packages()); if ("multicore" %in% tmp) { require(multicore); return(TRUE); } else { paste("\nWhoops. \"mc\" argument requested, but package \"multicore\" is not installed. Install with:\n\n", "\tinstall.packages(\"multicore\", INSTALL_opts=\"--byte-compile\");\n\n", "Stopping.\n", sep="") -> x; stop(x); } } On 10 Dec, 2012, at 06:17, Klaus Schliep <klaus.schl...@gmail.com> wrote: > Hi José, > > when R-code is to slow, parallelisation is pretty much on the last > thing to do on my agenda. Often parallelisation cannot easily be > achieved and it is hardware dependent (fork, openMP, MPI, GPGPU). > mcapply is the exception as it is easy to use, but does not work for > windows or in a GUI environment. > I always start with profiling the code identifying in which functions > most computational time is spent. R has really good profiling tools > (for memory and speed) to do this, have a look at the example in > summaryRprof. > Than try to speed up the R-code (e.g. avoid loops, vectorize or just > compare functions from different packages), if not enough use some > C/FORTRAN code and just after this I would think of parallelisation. > > For your problem I agree with the other comments that parallelisation > is likely an overkill. I assume it is enough to use the S3 class > "phylo" (ape) and some functions (e.g. Descendants and mrca.phylo > below) in phangorn instead of the S4 classes "phylo4" from phylobase. > These functions do less error checking and are not as comfortable to > use, but sometimes are substantially faster, see the time comparison > below. > > Additionally you may also change the order of how you loop over trees > and nodes. This way you can avoid multiple calls to Descendants, but > this will be more work. Computing a list of Descendants for a tree for > a vector of nodes is pretty much the same time as for one single node > and can be done with one call. So your code may be even an order of > magnitude faster!! > > If you give it a try make sure that the trees are having the same > ordering of taxa labels. The function ".compressTipLabel" can be very > useful in this respect. > Here is a function as replacement of MRCA (it will be available in the > next phangorn release): > > mrca.phylo <- function(x, node){ > anc <- Ancestors(x, node, type = "all") > res <- Reduce(intersect, anc)[1] > res > } > > > And here are some time comparisons: >> library(phylobase) >> library(phangorn) >> set.seed(123) >> tree = rtree(1000) >> tree4 = as(tree, "phylo4") >> system.time(Descendants(tree, 1002)) > user system elapsed > 0 0 0 >> system.time(descendants(tree4, 1002)) > user system elapsed > 0.012 0.000 0.015 >> system.time(Descendants(tree, 1002:1111)) > user system elapsed > 0.000 0.000 0.001 >> system.time(descendants(tree4, 1002:1111)) > user system elapsed > 0.676 0.000 0.678 >> system.time(MRCA(tree4, 1:5)) > user system elapsed > 0.296 0.000 0.297 >> system.time(mrca.phylo(tree, 1:5)) > user system elapsed > 0.000 0.000 0.002 > > > > Regards, > Klaus > > > On 12/10/12, José Hidasi <hidasin...@gmail.com> wrote: >> Thank you very much for the suggestions, >> >> I will take a look into these softwares and packages, especially xgrid and >> mcapply(). They will be surely a great help for minimizing the time for the >> analysis I am doing. So in this case, I think the best way is to divide the >> analysis through cores, making multithreading loops. >> >> Thanks again for the ideas, they were of great help. >> >> Best, >> José >> >> >> On Sun, Dec 9, 2012 at 4:15 AM, Brian O'Meara <bome...@utk.edu> wrote: >> >>> I agree with Daniel that going in parallel is probably overkill in this >>> case. However, if you do want to get into parallelization with R, a good >>> place to start is the CRAN task view on high performance and parallel >>> computing: >>> http://cran.r-project.org/web/views/HighPerformanceComputing.html . Like >>> all task views, it provides an overview of the packages in that domain >>> and >>> an easy way to install all of them at once (see >>> http://cran.r-project.org/web/views/ ). The built-in parallel package >>> (since R 2.14) makes it easy to use all your cores using mclapply(). If >>> you >>> tend to think in for loops (as most of us do) rather than apply >>> functions, >>> the foreach package combined with doMC is another easy way: there's a >>> good >>> vignette for this at >>> http://cran.r-project.org/web/packages/doMC/vignettes/gettingstartedMC.pdf. >>> >>> Best, >>> Brian >>> >>> >>> _______________________________________ >>> Brian O'Meara >>> Assistant Professor >>> Dept. of Ecology & Evolutionary Biology >>> U. of Tennessee, Knoxville >>> http://www.brianomeara.info >>> >>> Students wanted: Applications due Dec. 15, annually >>> Postdoc collaborators wanted: Check NIMBioS' website >>> Calendar: http://www.brianomeara.info/calendars/omeara >>> >>> >>> On Sat, Dec 8, 2012 at 7:01 AM, Daniel Barker >>> <d...@st-andrews.ac.uk>wrote: >>> >>>> Dear Jose, >>>> >>>> Is this a problem in practice? The calculation of branch lengths Brian >>>> describes do not sound very time-consuming to run. If you're dealing >>>> with >>>> an enormous tree or enormous sample, they would take some time - but >>>> presumably, the greater bottleneck would be obtaining the sample of >>>> trees >>>> in the first place. >>>> >>>> Anything can become time-consuming if repeated, e.g. if you're dealing >>>> with thousands of separate samples of trees. If things are taking too >>>> long >>>> in that situation, the best solution would be to run serial processes, >>>> not >>>> multithreaded - but many of them, submitted to a batch queuing system >>>> running on the computer. Examples include SLURM, GridEngine, LSF, Unix >>>> 'batch' (comes as part of OS X but I've never managed to set up the load >>>> threshold appropriately on OS X), or perhaps Xgrid (I don't know it). >>>> >>>> Except in very unusual situations, this kind of 'task farming', if >>>> properly set up, will always be at least as efficient as 'true' >>>> parallelisation like multithreading - often faster. It also doesn't >>>> require any special programming, and parallel programming can be fairly >>>> painful. >>>> >>>> An even simpler approach is to divide the task into n approximately >>>> equally time-consuming scripts (if you have n CPU cores) and launch them >>>> all at the same time - which uses all your cores, and doesn't even >>>> require >>>> a batch queuing system. >>>> >>>> >>>> Best wishes, >>>> >>>> Daniel >>>> >>>> On 08/12/2012 02:43, "José Hidasi" <hidasin...@gmail.com> wrote: >>>> >>>>> Dear all, >>>>> >>>>> I want to create a consensus tree with branch lengths (Brian O'Meara's >>>>> function on post "*[R-sig-phylo] Why no branch lengths on consensus >>>>> trees?*) using >>>>> a Mac Workstation. However, if I only type the function on it, R will >>>>> not >>>>> use all cores for running the analysis. I would like to know if there >>>>> is >>>> a >>>>> function or any way to divide the analysis within the cores, or to use >>>> all >>>>> cores for running the program. >>>>> >>>>> I know this is not the best forum to ask something like that (using >>>>> multiple cores), but I imagined that someone might have the solution >>>>> for >>>>> that as some of you work with large databases. >>>>> >>>>> Best, >>>>> José Hidasi >>>>> >>>>> >>>>> >>>>> * >>>>> This is Brian O'Meara's function, that i am using to create the >>>>> consensus >>>>> tree, on the post "*[R-sig-phylo] Why no branch lengths on consensus >>>>> trees?":* >>>>> >>>>> "I have a function to create a consensus tree with branch lengths. You >>>>> feed it a given topology (often a consensus topology, made with ape), >>>> then >>>>> a list of trees, and tell it what you want the branch lengths to >>>>> represent. It could be the proportion of input trees with that edge >>>>> (good >>>>> for summarizing bootstrap or Bayes proportions) or the mean, median, or >>>> sd >>>>> of branch lengths for those trees that have that edge. Consensus branch >>>>> lengths in units of proportion of matching trees has obvious utility. >>>>> As Daniel says, the average branch lengths across a set of trees is >>>>> more difficult to see a use case for, but you could imagine doing >>>>> something >>>>> like taking the ratogram output from r8s on a set of trees and >>>> summarizing >>>>> the rate average and rate sd on a given, "best", tree as two sets of >>>>> branch >>>>> lengths on that tree. >>>>> >>>>> I've put the function source at >>>>> >>>> https://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/R/consensusBrl >>>>> en.R?revision=110&root=omearalab >>>>> . >>>>> You can source the file for the function (consensusBrlen() ) and >>>>> other >>>>> functions it needs. It also uses phylobase. Note that this is >>>>> alpha-quality >>>>> code -- it's been checked a bit, but verify it's doing what you want. >>>>> >>>>> Here's an example of how to use it >>>>> >>>>> library(ape) >>>>> >>>>> library(phylobase) >>>>> >>>>> phy.a<-rcoal(15) >>>>> >>>>> phy.b<-phy.a >>>>> >>>>> phy.b$edge.length<-phy.b$edge.length+runif(length(phy.b$edge.length), >>>>> 0, >>>>> 0.1) >>>>> >>>>> phy.c<-rcoal(15) >>>>> >>>>> phy.list<-list(phy.a, phy.b, phy.c) >>>>> >>>>> phy.consensus<-consensusBrlen(phy.a, list(phy.a, phy.b, phy.c), >>>>> type="mean_brlen")" >>>>> >>>>> -- >>>>> José Hidasi Neto >>>>> Graduated in Biological Sciences - Universidade Federal de Goiás (UFG) >>>>> Master's candidate in Ecology and Evolution - Community Ecology and >>>>> Functioning Lab - UFG >>>>> Lattes: >>>>> http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4293841A0 >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> >>>> >>>> >>>> -- >>>> Daniel Barker >>>> http://bio.st-andrews.ac.uk/staff/db60.htm >>>> The University of St Andrews is a charity registered in Scotland : No >>>> SC013532 >>>> >>>> _______________________________________________ >>>> R-sig-phylo mailing list - R-sig-phylo@r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo >>>> Searchable archive at >>>> http://www.mail-archive.com/r-sig-phylo@r-project.org/ >>>> >>> >>> >> >> >> -- >> José Hidasi Neto >> Graduated in Biological Sciences - Universidade Federal de Goiás (UFG) >> Master's candidate in Ecology and Evolution - Community Ecology and >> Functioning Lab - UFG >> Lattes: >> http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4293841A0 >> Website: http://rfunctions.blogspot.com.br/ >> >> [[alternative HTML version deleted]] >> >> > > > -- > Klaus Schliep > Phylogenomics Lab at the University of Vigo, Spain > > _______________________________________________ > R-sig-phylo mailing list - R-sig-phylo@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo > Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/ _____________________________________________________ Joseph W. Brown Post-doctoral Researcher, Smith Laboratory University of Michigan Department of Ecology & Evolutionary Biology Room 2071, Kraus Natural Sciences Building Ann Arbor MI 48109-1079 josep...@umich.edu [[alternative HTML version deleted]]
_______________________________________________ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/