Re: [R-sig-phylo] cleaning code in ape

Emmanuel Paradis Sun, 08 Feb 2026 08:22:26 -0800

Hi Brian,

Thank you for your message. You point out to the importance of maintenance, 
which is, I think, the second most important feature of open-source software 
(the first one being that it must do "something useful"). Coincidentally, I 
came across several papers in some bioinformatic journals reporting packages, 
programs, web resouces, and others, published in the 2000's and 2010's: most of 
the URLs are defunct, or the packages are orphaned or not updated for some 
years. It would be interesting to evaluate this more precisely, but it seems 
that quite a lot of free software is "lost" after 10 years (it's a rough guess 
of course).


Back to my first message, I find difficult to decide what is "never used" in 
ape and could be safely (and usefully) removed. Maybe the best approach to this 
question is to open an issue on GH asking if somebody uses this or that feature 
before deciding to remove it (a message to r-sig-phylo could be useful too).

About compressing "DNAbin" objects, I made some tests with a simple scheme 
(storing only the polymorphic sites with the first sequence). This gives nice 
results with shallow divergences (the woodmouse data are a good example). Also 
this is not expected to require drastic changes in the internal codes. I tried 
other schemes but the results were not so good. I've juse posted some code on 
GH:

https://github.com/emmanuelparadis/DNAbin_compression

Thanks for the references about phylogenetic compressing. I'll have a look at 
them.

Best,

Emmanuel

----- Le 30 Jan 26, à 16:48, omeara brian [email protected] a écrit :

> Thanks for your work on this (and DECADES of work maintaining and extending 
> ape,
> plus all the work just keeping it on the CRAN treadmill).
> 
> Some ideas:
> 
> Ané and Sanderson (2005) could be a relevant paper for compression that
> incorporates the benefits that come from phylogenetic relatedness:
> https://doi.org/10.1080/10635150590905984 . Appendix A gets into the algorithm
> itself.
> 
> A different approach for handling large data, especially with files on disk, 
> is
> used by Arrow: https://arrow.apache.org/ . There's an associated R package if
> you're willing to add on dependencies: https://arrow.apache.org/docs/r/ .
> 
> If I remember correctly, the EcoJulia team (https://github.com/EcoJulia) or 
> the
> JuliaPhylo team (https://juliaphylo.github.io/JuliaPhyloWebsite/) is working 
> on
> another way to store phylogenetic networks on disk. If you're working on tree
> storage and io already, maybe it could be worth coordinating with them on this
> in ape (and networks in general are cool). If I remember correctly, there are
> already a couple of competing "eNewick" formats and perhaps others, but I
> haven't dug into that in a bit. And, of course, the required link whenever
> someone talks about standards: https://xkcd.com/927/ .
> 
> Thank you again,
> Brian
> 
> From: R-sig-phylo <[email protected]> on behalf of Emmanuel
> Paradis <[email protected]>
> Date: Friday, January 30, 2026 at 06:54:02
> To: mailinglist R <[email protected]>
> Subject: Re: [R-sig-phylo] cleaning code in ape
> 
> Dear all,
> 
> Thanks for the suggestions so far! Here are two things I have had in mind for
> some time:
> 
> 1) Compression of data objects (on the same model than the sparse matrices in
> package Matrix and others). For instance, if you do:
> 
> library(ape)
> data(woodmouse)
> alview(woodmouse[, seg.sites(woodmouse, strict=TRUE)])
> 
> you can see that it's possible to store only the sites which are different
> compared to the 1st sequence. That would compress the data by more than 3
> times, and the object could be analysed without uncompressing it (base
> frequencies, distances, ...) There may be a way to do it in a "smart" way
> (compressing the sequences sequentially depending on their similarity).
> 
> Something similar might be feasable for trees.
> 
> 2) Sequential access to large files (with caching). In some situations, it 
> might
> be interesting to screen the sequences (and eventually drop some of them)
> before alignment (I'm thinking about users working viral sequences). 
> Biostrings
> (in BioConductor) has this kind of functionality but maybe it'd be nice to 
> have
> this in ape too (and now that we've re-introduced the function mafft() in ape
> that makes sense since MAFFT performs well with big alignments).
> 
> The same could be useful for tree files too (eg, if someone has run a very 
> long
> MCMC run).
> 
> Best,
> 
> Emmanuel
> 
> ----- Le 29 Jan 26, à 10:24, Vojtěch Zeisek [email protected] a écrit :
> 
>> Hello
>>
>> Dne úterý 27. ledna 2026 17:14:22, středoevropský standardní čas jste
>> napsal(a):
>>> Dear all,
>>> here are a few explanations of the suggested changes and some
>>> subjective comments.
>>
>> Thank You for the comments.
>>
>>> > this a perfect idea. :-) I'd love to see one two things:
>>> > 1) Support for parallelization whenever possible (various distance,
>>> > work with multi.whatever objects, ...) to speed things up.
>>>
>>> This is generally a good idea. The problem is that parallelization depends
>>> on the hardware (cluster, multicore machine) of the user, the operating
>>> system (usually easy on Linux, tricky on OSX and Windows) and
>>> additionally whether you run R inside a GUI or from the console.
>>> Additionally some matrix algebra code might already be parallelized.
>>> This depends on the BLAS library you use, so if you use parallelization
>>> on top you might slow down your computer.
>>> This is why I started using the future and future.apply packages
>>> in phangorn instead of mclapply. This puts the user in control to
>>> choose the parallelization framework to use and I don't need to
>>> check the operating system, number of cores, GUI etc.
>>> Some low level openMP stuff in C/C++ code might be still nice.
>>
>> Yeah, I know the parallelization is a difficult topic, and I don't know much
>> about macOS and Windows. I use to have generally a good experience with
>> future.apply. In any case I think we agree that we should add parallelization
>> whenever possible. :-)
>> IMHO all functions handling multiPhylo, or producing any sort of matrix,
>> indices etc. should have parallelization support.
>>
>>> > 2) Removal of all the *.mutliPhylo functions, i.e. IMHO the
>>> > best would be if every relevant function would support
>>> > phylo as well as multiPhylo objects. Now it's a bit confusing
>>> > whenever to use which function...
>>>
>>> I think there might be a misconception here. We introduced a lot of
>>> generic functions to ape, e.g. root, is.rooted, is.ultrametric etc.
>>> Now while there exist is.rooted.phylo() and is.rooted.multiPhylo(), users
>>> should only need to use is.rooted(x) and don't need to care about the phylo
>>> or multiPhylo versions. Maybe it is more a problem with the documentation?
>>
>> That's a good point. Yeah, it might be rather confusing feature of
>> documentation. From my experience, even more confusing is then that You can
>> run just plot() to plot a phylo object, but to get respective help You must
>> use ?plot.phylo...
>> Sincerely,
>> V.
>>
>>> > On Mon, Jan 26, 2026 at 9:31 AM Emmanuel Paradis wrote:
>>> > > Dear all,
>>> > > We are in the process of reviewing the "old" code in ape (some
>>> > > written in 2001). Here are a few things that came out recently:
>>> > > 1) During a recent discussion, we wondered if the option "..." of
>>> > > read.tree() is useful; it is passed internally to scan(). A review of
>>> > > the CRAN packages suggests this option is useless so it could
>>> > > be removed, at least without breaking those packages. There
>>> > > may be other bits of code that can be removed safely in other functions.
>>> > > 2) Printing of objects could be improved.
>>> > > 3) I've (re)introduced a function mafft() in ape. A function with the
>>> > > same name was formerly in ips which is now orphaned on CRAN.
>>> > > 4) A review of the man pages (help) would be useful. For instance,
>>> > > in ?read.tree one can read: "If there are two root edges (e.g.,
>>> > > "(((A:1,B:1):10):10);"), then the tree is not read and an error message
>>> > > is issued." [1] which is wrong since all types of Newick tree can be
>>> > > read. There are certainly similar outdated statements in the 300
>>> > > pages of the manual.
>>> > > 5) Klaus suggests to have more functions returning their "return
>>> > > value" invisibly to make easier the use of pipe operators (|> or %>%).
>>> > > Any thoughts, ideas, or comments are welcome.
>>> > > Best,
>>> > > Emmanuel
>>> > > [1] In version 5.8-1 currently on CRAN; now fixed on GitHub.
>> --
>> Vojtěch Zeisek
>> https://trapa.cz/en/
>>
>> Department of Botany, Faculty of Science
>> Charles University, Prague, Czech Republic
>> https://botany.natur.cuni.cz/
>>
>> Institute of Botany, Czech Academy of Sciences
>> Průhonice, Czech Republic
>> https://www.ibot.cas.cz/en/
>> Computing cluster
>> https://sorbus.ibot.cas.cz/en/start
>>
>> _______________________________________________
>> R-sig-phylo mailing list - [email protected]
>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
>> Searchable archive at http://www.mail-archive.com/[email protected]/
> 
> _______________________________________________
> R-sig-phylo mailing list - [email protected]
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at http://www.mail-archive.com/[email protected]/
> 
>       [[alternative HTML version deleted]]
> 
> _______________________________________________
> R-sig-phylo mailing list - [email protected]
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at http://www.mail-archive.com/[email protected]/

_______________________________________________
R-sig-phylo mailing list - [email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/[email protected]/

Re: [R-sig-phylo] cleaning code in ape

Reply via email to