Re: [R-sig-phylo] Analysis with Multiple cores on Mac Workstation

Joseph W. Brown Mon, 10 Dec 2012 05:05:10 -0800

I have found that ".compressTipLabel" can take quite a while for larger tree 
distributions. By vectorizing the for loop, I've got substantial speed 
improvements. For really big distributions, multithreading definitely helps 
(but does come with a memory cost).


Code is below. HTH.
Joseph.


# rearrange tip labels so they are consistent with some ordering, either first 
tree or from 'refTree'
# taken from ape:::.compressTipLabel
# changes: 1) vectorize for loop, 2) add multicore option
manageTipLabels <- function (phy, refTree=NULL, mc=FALSE, numCores=NULL) {
        if (mc) check.multicore();
        
        cat("\nManaging tip label ordering across trees...");
        
        if (!is.null(attr(phy, "TipLabel"))) return(phy);
        
        if (is.null(refTree)) ref <- phy[[1]]$tip.label
        else ref <- refTree$tip.label;
        
        if (any(table(ref) != 1)) stop("some tip labels are duplicated in tree 
no. 1");
        
        n <- length(ref);
        Ntree <- length(phy);
        
        relabel <- function (y) {
                label <- y$tip.label;
                if (!identical(label, ref)) {
                        if (length(label) != length(ref)) stop(paste("tree ", 
y, "has a different number of tips"));
                        ilab <- match(label, ref);
                        if (any(is.na(ilab))) stop(paste("tree ", y, "has 
different tip labels"));
                        ie <- match(1:n, y$edge[, 2]);
                        y$edge[ie, 2] <- ilab;
                }
                y$tip.label <- NULL;
                y;
        }
        
        if (mc) phy <- mclapply(phy, relabel, mc.cores=numCores)
        else phy <- lapply(phy, relabel);
        
        attr(phy, "TipLabel") <- ref;
        class(phy) <- "multiPhylo";
        
        cat(" done.\n\n");
        return(phy);
}

# Windows or GUI == no es bueno
check.multicore <- function ()  {
        if (Sys.info()["sysname"] == "Windows") { # no soup for you!
                stop("\"mc\" argument requested, but package \"multicore\" is 
not available for Windows. Stopping.\n");
        }
        if (!is.na(Sys.getenv()["R_GUI_APP_VERSION"])) { # or you!
                stop("\"mc\" argument requested, but package \"multicore\" 
cannot be run in a GUI environment. Stopping.\n");
        }
        tmp = rownames(installed.packages());
        if ("multicore" %in% tmp) {
                require(multicore);
                return(TRUE);
        } else {
                paste("\nWhoops. \"mc\" argument requested, but package 
\"multicore\" is not installed. Install with:\n\n",
                        "\tinstall.packages(\"multicore\", 
INSTALL_opts=\"--byte-compile\");\n\n",
                        "Stopping.\n", sep="") -> x;
                stop(x);
        }
}


On 10 Dec, 2012, at 06:17, Klaus Schliep <klaus.schl...@gmail.com> wrote:

> Hi José,
> 
> when R-code is to slow, parallelisation is pretty much on the last
> thing to do on my agenda. Often parallelisation cannot easily be
> achieved and it is hardware dependent (fork, openMP, MPI, GPGPU).
> mcapply is the exception as it is easy to use, but does not work for
> windows or in a GUI environment.
> I always start with profiling the code identifying in which functions
> most computational time is spent. R has really good profiling tools
> (for memory and speed) to do this, have a look at the example in
> summaryRprof.
> Than try to speed up the R-code (e.g. avoid loops, vectorize or just
> compare functions from different packages), if not enough use some
> C/FORTRAN code and just after this I would think of parallelisation.
> 
> For your problem I agree with the other comments that parallelisation
> is likely an overkill. I assume it is enough to use the S3 class
> "phylo" (ape) and some functions (e.g. Descendants and mrca.phylo
> below) in phangorn instead of the S4 classes "phylo4" from phylobase.
> These functions do less error checking and are not as comfortable to
> use, but sometimes are substantially faster, see the time comparison
> below.
> 
> Additionally you may also change the order of how you loop over trees
> and nodes. This way you can avoid multiple calls to Descendants, but
> this will be more work. Computing a list of Descendants for a tree for
> a vector of nodes is pretty much the same time as for one single node
> and can be done with one call. So your code may be even an order of
> magnitude faster!!
> 
> If you give it a try make sure that the trees are having the same
> ordering of taxa labels. The function ".compressTipLabel" can be very
> useful in this respect.
> Here is a function as replacement of MRCA (it will be available in the
> next phangorn release):
> 
> mrca.phylo <- function(x, node){
>    anc <- Ancestors(x, node, type = "all")
>    res <- Reduce(intersect, anc)[1]
>    res
> }
> 
> 
> And here are some time comparisons:
>> library(phylobase)
>> library(phangorn)
>> set.seed(123)
>> tree  = rtree(1000)
>> tree4 = as(tree, "phylo4")
>> system.time(Descendants(tree, 1002))
>   user  system elapsed
>      0       0       0
>> system.time(descendants(tree4, 1002))
>   user  system elapsed
>  0.012   0.000   0.015
>> system.time(Descendants(tree, 1002:1111))
>   user  system elapsed
>  0.000   0.000   0.001
>> system.time(descendants(tree4, 1002:1111))
>   user  system elapsed
>  0.676   0.000   0.678
>> system.time(MRCA(tree4, 1:5))
>   user  system elapsed
>  0.296   0.000   0.297
>> system.time(mrca.phylo(tree, 1:5))
>   user  system elapsed
>  0.000   0.000   0.002
> 
> 
> 
> Regards,
> Klaus
> 
> 
> On 12/10/12, José Hidasi <hidasin...@gmail.com> wrote:
>> Thank you very much for the suggestions,
>> 
>> I will take a look into these softwares and packages, especially xgrid and
>> mcapply(). They will be surely a great help for minimizing the time for the
>> analysis I am doing. So in this case, I think the best way is to divide the
>> analysis through cores, making multithreading loops.
>> 
>> Thanks again for the ideas, they were of great help.
>> 
>> Best,
>> José
>> 
>> 
>> On Sun, Dec 9, 2012 at 4:15 AM, Brian O'Meara <bome...@utk.edu> wrote:
>> 
>>> I agree with Daniel that going in parallel is probably overkill in this
>>> case. However, if you do want to get into parallelization with R, a good
>>> place to start is the CRAN task view on high performance and parallel
>>> computing:
>>> http://cran.r-project.org/web/views/HighPerformanceComputing.html . Like
>>> all task views, it provides an overview of the packages in that domain
>>> and
>>> an easy way to install all of them at once (see
>>> http://cran.r-project.org/web/views/ ). The built-in parallel package
>>> (since R 2.14) makes it easy to use all your cores using mclapply(). If
>>> you
>>> tend to think in for loops (as most of us do) rather than apply
>>> functions,
>>> the foreach package combined with doMC is another easy way: there's a
>>> good
>>> vignette for this at
>>> http://cran.r-project.org/web/packages/doMC/vignettes/gettingstartedMC.pdf.
>>> 
>>> Best,
>>> Brian
>>> 
>>> 
>>> _______________________________________
>>> Brian O'Meara
>>> Assistant Professor
>>> Dept. of Ecology & Evolutionary Biology
>>> U. of Tennessee, Knoxville
>>> http://www.brianomeara.info
>>> 
>>> Students wanted: Applications due Dec. 15, annually
>>> Postdoc collaborators wanted: Check NIMBioS' website
>>> Calendar: http://www.brianomeara.info/calendars/omeara
>>> 
>>> 
>>> On Sat, Dec 8, 2012 at 7:01 AM, Daniel Barker
>>> <d...@st-andrews.ac.uk>wrote:
>>> 
>>>> Dear Jose,
>>>> 
>>>> Is this a problem in practice? The calculation of branch lengths Brian
>>>> describes do not sound very time-consuming to run. If you're dealing
>>>> with
>>>> an enormous tree or enormous sample, they would take some time - but
>>>> presumably, the greater bottleneck would be obtaining the sample of
>>>> trees
>>>> in the first place.
>>>> 
>>>> Anything can become time-consuming if repeated, e.g. if you're dealing
>>>> with thousands of separate samples of trees. If things are taking too
>>>> long
>>>> in that situation, the best solution would be to run serial processes,
>>>> not
>>>> multithreaded - but many of them, submitted to a batch queuing system
>>>> running on the computer. Examples include SLURM, GridEngine, LSF, Unix
>>>> 'batch' (comes as part of OS X but I've never managed to set up the load
>>>> threshold appropriately on OS X), or perhaps Xgrid (I don't know it).
>>>> 
>>>> Except in very unusual situations, this kind of 'task farming', if
>>>> properly set up, will always be at least as efficient as 'true'
>>>> parallelisation like multithreading - often faster. It also doesn't
>>>> require any special programming, and parallel programming can be fairly
>>>> painful.
>>>> 
>>>> An even simpler approach is to divide the task into n approximately
>>>> equally time-consuming scripts (if you have n CPU cores) and launch them
>>>> all at the same time - which uses all your cores, and doesn't even
>>>> require
>>>> a batch queuing system.
>>>> 
>>>> 
>>>> Best wishes,
>>>> 
>>>> Daniel
>>>> 
>>>> On 08/12/2012 02:43, "José Hidasi" <hidasin...@gmail.com> wrote:
>>>> 
>>>>> Dear all,
>>>>> 
>>>>> I want to create a consensus tree with branch lengths (Brian O'Meara's
>>>>> function on post "*[R-sig-phylo] Why no branch lengths on consensus
>>>>> trees?*) using
>>>>> a Mac Workstation. However, if I only type the function on it, R will
>>>>> not
>>>>> use all cores for running the analysis. I would like to know if there
>>>>> is
>>>> a
>>>>> function or any way to divide the analysis within the cores, or to use
>>>> all
>>>>> cores for running the program.
>>>>> 
>>>>> I know this is not the best forum to ask something like that (using
>>>>> multiple cores), but I imagined that someone might have the solution
>>>>> for
>>>>> that as some of you work with large databases.
>>>>> 
>>>>> Best,
>>>>> José Hidasi
>>>>> 
>>>>> 
>>>>> 
>>>>> *
>>>>> This is Brian O'Meara's function, that i am using to create the
>>>>> consensus
>>>>> tree, on the post "*[R-sig-phylo] Why no branch lengths on consensus
>>>>> trees?":*
>>>>> 
>>>>> "I have a function to create a consensus tree with branch lengths. You
>>>>> feed it a given topology (often a consensus topology, made with ape),
>>>> then
>>>>> a list of trees, and tell it what you want the branch lengths to
>>>>> represent. It could be the proportion of input trees with that edge
>>>>> (good
>>>>> for summarizing bootstrap or Bayes proportions) or the mean, median, or
>>>> sd
>>>>> of branch lengths for those trees that have that edge. Consensus branch
>>>>> lengths in units of proportion of matching trees has obvious utility.
>>>>> As Daniel says, the average branch lengths across a set of trees is
>>>>> more difficult to see a use case for, but you could imagine doing
>>>>> something
>>>>> like taking the ratogram output from r8s on a set of trees and
>>>> summarizing
>>>>> the rate average and rate sd on a given, "best", tree as two sets of
>>>>> branch
>>>>> lengths on that tree.
>>>>> 
>>>>> I've put the function source at
>>>>> 
>>>> https://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/R/consensusBrl
>>>>> en.R?revision=110&root=omearalab
>>>>> .
>>>>> You can source the file for the function (consensusBrlen() ) and
>>>>> other
>>>>> functions it needs. It also uses phylobase. Note that this is
>>>>> alpha-quality
>>>>> code -- it's been checked a bit, but verify it's doing what you want.
>>>>> 
>>>>> Here's an example of how to use it
>>>>> 
>>>>> library(ape)
>>>>> 
>>>>> library(phylobase)
>>>>> 
>>>>> phy.a<-rcoal(15)
>>>>> 
>>>>> phy.b<-phy.a
>>>>> 
>>>>> phy.b$edge.length<-phy.b$edge.length+runif(length(phy.b$edge.length),
>>>>> 0,
>>>>> 0.1)
>>>>> 
>>>>> phy.c<-rcoal(15)
>>>>> 
>>>>> phy.list<-list(phy.a, phy.b, phy.c)
>>>>> 
>>>>> phy.consensus<-consensusBrlen(phy.a, list(phy.a, phy.b, phy.c),
>>>>> type="mean_brlen")"
>>>>> 
>>>>> --
>>>>> José Hidasi Neto
>>>>> Graduated in Biological Sciences - Universidade Federal de Goiás (UFG)
>>>>> Master's candidate in Ecology and Evolution - Community Ecology and
>>>>> Functioning Lab - UFG
>>>>> Lattes:
>>>>> http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4293841A0
>>>>> 
>>>>>      [[alternative HTML version deleted]]
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Daniel Barker
>>>> http://bio.st-andrews.ac.uk/staff/db60.htm
>>>> The University of St Andrews is a charity registered in Scotland : No
>>>> SC013532
>>>> 
>>>> _______________________________________________
>>>> R-sig-phylo mailing list - R-sig-phylo@r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
>>>> Searchable archive at
>>>> http://www.mail-archive.com/r-sig-phylo@r-project.org/
>>>> 
>>> 
>>> 
>> 
>> 
>> --
>> José Hidasi Neto
>> Graduated in Biological Sciences - Universidade Federal de Goiás (UFG)
>> Master's candidate in Ecology and Evolution - Community Ecology and
>> Functioning Lab - UFG
>> Lattes:
>> http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4293841A0
>> Website: http://rfunctions.blogspot.com.br/
>> 
>>      [[alternative HTML version deleted]]
>> 
>> 
> 
> 
> -- 
> Klaus Schliep
> Phylogenomics Lab at the University of Vigo, Spain
> 
> _______________________________________________
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

_____________________________________________________
Joseph W. Brown
Post-doctoral Researcher, Smith Laboratory
University of Michigan
Department of Ecology & Evolutionary Biology
Room 2071, Kraus Natural Sciences Building
Ann Arbor MI 48109-1079
josep...@umich.edu


        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] Analysis with Multiple cores on Mac Workstation

Reply via email to