Re: [R-sig-phylo] cleaning code in ape

Klaus Schliep Mon, 09 Feb 2026 07:47:36 -0800

Hi Emmanuel and Brian,

On Sun, Feb 8, 2026 at 5:22 PM Emmanuel Paradis <[email protected]>
wrote:


> Hi Brian,
>
> Thank you for your message. You point out to the importance of
> maintenance, which is, I think, the second most important feature of
> open-source software (the first one being that it must do "something
> useful"). Coincidentally, I came across several papers in some
> bioinformatic journals reporting packages, programs, web resouces, and
> others, published in the 2000's and 2010's: most of the URLs are defunct,
> or the packages are orphaned or not updated for some years. It would be
> interesting to evaluate this more precisely, but it seems that quite a lot
> of free software is "lost" after 10 years (it's a rough guess of course).
>
> Back to my first message, I find difficult to decide what is "never used"
> in ape and could be safely (and usefully) removed. Maybe the best approach
> to this question is to open an issue on GH asking if somebody uses this or
> that feature before deciding to remove it (a message to r-sig-phylo could
> be useful too).
>

With reverse dependency checks one could figure out which function gets
called from another package, but of course this does not tell you which
functions are getting used. You can only figure out which functions are
called by other packages.
The r-universe site of ape (https://emmanuelparadis.r-universe.dev/ape)
mentions that it is used in over 18000 scripts. Following the link (
https://github.com/search?q=library%28ape%29&type=code) you get all the
scripts which are calling ape on github. I am not sure if it would be an
idea to count how often each function is called?


> About compressing "DNAbin" objects, I made some tests with a simple scheme
> (storing only the polymorphic sites with the first sequence). This gives
> nice results with shallow divergences (the woodmouse data are a good
> example). Also this is not expected to require drastic changes in the
> internal codes. I tried other schemes but the results were not so good.
> I've juse posted some code on GH:
>
> https://github.com/emmanuelparadis/DNAbin_compression
>
> Are you compressing sites here? In phangorn I compress phyDat objects,
storing only sites (columns) in an alignment which are unique, and store
the position and weight of each site pattern. The compression is high if
there are few taxa but long sequences (e.g.the yeast data set in phangorn).
But more importantly this saves computations in ML or MP as you
frequently loop over all sites. Here every site counts. There is a function
`phangorn:::compress.phyDat` to do this.

If you have alignments with many taxa (and maybe amino acids), the
compression will be very low. In this case storing differences to a
reference sequence might be much better. If one has thousands of similar
virus sequences (e.g. Covid)
you might end up with these all having different site patterns, but only a
few nucleotide changes between two sequences. One might also store an index
of leading and trailing gaps/ambiguous states. So it depends on the
alignment which compression works best.

Additionally I added a function to give a summary about data, constant
sites, parsimony informative sites, unique site patterns, etc.
phangorn::glance(as.phyDat(woodmouse))

Best,
Klaus

> Thanks for the references about phylogenetic compressing. I'll have a look
> at them.
>
> Best,
>
> Emmanuel
>
> ----- Le 30 Jan 26, à 16:48, omeara brian [email protected] a écrit :
>
> > Thanks for your work on this (and DECADES of work maintaining and
> extending ape,
> > plus all the work just keeping it on the CRAN treadmill).
> >
> > Some ideas:
> >
> > Ané and Sanderson (2005) could be a relevant paper for compression that
> > incorporates the benefits that come from phylogenetic relatedness:
> > https://doi.org/10.1080/10635150590905984 . Appendix A gets into the
> algorithm
> > itself.
> >
> > A different approach for handling large data, especially with files on
> disk, is
> > used by Arrow: https://arrow.apache.org/ . There's an associated R
> package if
> > you're willing to add on dependencies: https://arrow.apache.org/docs/r/
> .
> >
> > If I remember correctly, the EcoJulia team (https://github.com/EcoJulia)
> or the
> > JuliaPhylo team (https://juliaphylo.github.io/JuliaPhyloWebsite/) is
> working on
> > another way to store phylogenetic networks on disk. If you're working on
> tree
> > storage and io already, maybe it could be worth coordinating with them
> on this
> > in ape (and networks in general are cool). If I remember correctly,
> there are
> > already a couple of competing "eNewick" formats and perhaps others, but I
> > haven't dug into that in a bit. And, of course, the required link
> whenever
> > someone talks about standards: https://xkcd.com/927/ .
> >
> > Thank you again,
> > Brian
> >
> > From: R-sig-phylo <[email protected]> on behalf of
> Emmanuel
> > Paradis <[email protected]>
> > Date: Friday, January 30, 2026 at 06:54:02
> > To: mailinglist R <[email protected]>
> > Subject: Re: [R-sig-phylo] cleaning code in ape
> >
> > Dear all,
> >
> > Thanks for the suggestions so far! Here are two things I have had in
> mind for
> > some time:
> >
> > 1) Compression of data objects (on the same model than the sparse
> matrices in
> > package Matrix and others). For instance, if you do:
> >
> > library(ape)
> > data(woodmouse)
> > alview(woodmouse[, seg.sites(woodmouse, strict=TRUE)])
> >
> > you can see that it's possible to store only the sites which are
> different
> > compared to the 1st sequence. That would compress the data by more than 3
> > times, and the object could be analysed without uncompressing it (base
> > frequencies, distances, ...) There may be a way to do it in a "smart" way
> > (compressing the sequences sequentially depending on their similarity).
> >
> > Something similar might be feasable for trees.
> >
> > 2) Sequential access to large files (with caching). In some situations,
> it might
> > be interesting to screen the sequences (and eventually drop some of them)
> > before alignment (I'm thinking about users working viral sequences).
> Biostrings
> > (in BioConductor) has this kind of functionality but maybe it'd be nice
> to have
> > this in ape too (and now that we've re-introduced the function mafft()
> in ape
> > that makes sense since MAFFT performs well with big alignments).
> >
> > The same could be useful for tree files too (eg, if someone has run a
> very long
> > MCMC run).
> >
> > Best,
> >
> > Emmanuel
> >
> > ----- Le 29 Jan 26, à 10:24, Vojtěch Zeisek [email protected] a écrit :
> >
> >> Hello
> >>
> >> Dne úterý 27. ledna 2026 17:14:22, středoevropský standardní čas jste
> >> napsal(a):
> >>> Dear all,
> >>> here are a few explanations of the suggested changes and some
> >>> subjective comments.
> >>
> >> Thank You for the comments.
> >>
> >>> > this a perfect idea. :-) I'd love to see one two things:
> >>> > 1) Support for parallelization whenever possible (various distance,
> >>> > work with multi.whatever objects, ...) to speed things up.
> >>>
> >>> This is generally a good idea. The problem is that parallelization
> depends
> >>> on the hardware (cluster, multicore machine) of the user, the operating
> >>> system (usually easy on Linux, tricky on OSX and Windows) and
> >>> additionally whether you run R inside a GUI or from the console.
> >>> Additionally some matrix algebra code might already be parallelized.
> >>> This depends on the BLAS library you use, so if you use parallelization
> >>> on top you might slow down your computer.
> >>> This is why I started using the future and future.apply packages
> >>> in phangorn instead of mclapply. This puts the user in control to
> >>> choose the parallelization framework to use and I don't need to
> >>> check the operating system, number of cores, GUI etc.
> >>> Some low level openMP stuff in C/C++ code might be still nice.
> >>
> >> Yeah, I know the parallelization is a difficult topic, and I don't know
> much
> >> about macOS and Windows. I use to have generally a good experience with
> >> future.apply. In any case I think we agree that we should add
> parallelization
> >> whenever possible. :-)
> >> IMHO all functions handling multiPhylo, or producing any sort of matrix,
> >> indices etc. should have parallelization support.
> >>
> >>> > 2) Removal of all the *.mutliPhylo functions, i.e. IMHO the
> >>> > best would be if every relevant function would support
> >>> > phylo as well as multiPhylo objects. Now it's a bit confusing
> >>> > whenever to use which function...
> >>>
> >>> I think there might be a misconception here. We introduced a lot of
> >>> generic functions to ape, e.g. root, is.rooted, is.ultrametric etc.
> >>> Now while there exist is.rooted.phylo() and is.rooted.multiPhylo(),
> users
> >>> should only need to use is.rooted(x) and don't need to care about the
> phylo
> >>> or multiPhylo versions. Maybe it is more a problem with the
> documentation?
> >>
> >> That's a good point. Yeah, it might be rather confusing feature of
> >> documentation. From my experience, even more confusing is then that You
> can
> >> run just plot() to plot a phylo object, but to get respective help You
> must
> >> use ?plot.phylo...
> >> Sincerely,
> >> V.
> >>
> >>> > On Mon, Jan 26, 2026 at 9:31 AM Emmanuel Paradis wrote:
> >>> > > Dear all,
> >>> > > We are in the process of reviewing the "old" code in ape (some
> >>> > > written in 2001). Here are a few things that came out recently:
> >>> > > 1) During a recent discussion, we wondered if the option "..." of
> >>> > > read.tree() is useful; it is passed internally to scan(). A review
> of
> >>> > > the CRAN packages suggests this option is useless so it could
> >>> > > be removed, at least without breaking those packages. There
> >>> > > may be other bits of code that can be removed safely in other
> functions.
> >>> > > 2) Printing of objects could be improved.
> >>> > > 3) I've (re)introduced a function mafft() in ape. A function with
> the
> >>> > > same name was formerly in ips which is now orphaned on CRAN.
> >>> > > 4) A review of the man pages (help) would be useful. For instance,
> >>> > > in ?read.tree one can read: "If there are two root edges (e.g.,
> >>> > > "(((A:1,B:1):10):10);"), then the tree is not read and an error
> message
> >>> > > is issued." [1] which is wrong since all types of Newick tree can
> be
> >>> > > read. There are certainly similar outdated statements in the 300
> >>> > > pages of the manual.
> >>> > > 5) Klaus suggests to have more functions returning their "return
> >>> > > value" invisibly to make easier the use of pipe operators (|> or
> %>%).
> >>> > > Any thoughts, ideas, or comments are welcome.
> >>> > > Best,
> >>> > > Emmanuel
> >>> > > [1] In version 5.8-1 currently on CRAN; now fixed on GitHub.
> >> --
> >> Vojtěch Zeisek
> >> https://trapa.cz/en/
> >>
> >> Department of Botany, Faculty of Science
> >> Charles University, Prague, Czech Republic
> >> https://botany.natur.cuni.cz/
> >>
> >> Institute of Botany, Czech Academy of Sciences
> >> Průhonice, Czech Republic
> >> https://www.ibot.cas.cz/en/
> >> Computing cluster
> >> https://sorbus.ibot.cas.cz/en/start
> >>
> >> _______________________________________________
> >> R-sig-phylo mailing list - [email protected]
> >> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> >> Searchable archive at
> http://www.mail-archive.com/[email protected]/
> >
> > _______________________________________________
> > R-sig-phylo mailing list - [email protected]
> > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> > Searchable archive at
> http://www.mail-archive.com/[email protected]/
> >
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > R-sig-phylo mailing list - [email protected]
> > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> > Searchable archive at
> http://www.mail-archive.com/[email protected]/
>
> _______________________________________________
> R-sig-phylo mailing list - [email protected]
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at
> http://www.mail-archive.com/[email protected]/
>


-- 
Klaus Schliep

Senior Scientist
Institute of Molecular Biotechnology
TU Graz
https://www.imbt.tugraz.at

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - [email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/[email protected]/

Re: [R-sig-phylo] cleaning code in ape

Reply via email to