Re: [R-sig-phylo] cleaning code in ape

Klaus Schliep Mon, 02 Feb 2026 02:19:44 -0800

Dear all,

just a few more comments from my side. It might be worth thinking quite a
bit about the user experience.


1. With regards to compression of data objects I probably would have said a
year ago just buy more RAM. In a sense the vcf format does something
similar. Depending on the implementation it could not only save a bit of
memory but also speed up some algorithms (distance, base.freq) etc.
And for the user one could hide most of the details, which is nice.

2.  Sequential access to large files. From my side this would be more
relevant. As I understand this, the main purpose is filtering/subsetting
data or getting some basic summary statistics. I would like to have a dplyr
inspired syntax, which can either work on DNAbin/AAbin objects, multiPhylo
objects or on a file directly.
I several times wrote some custom scripts opening connections and
subsetting alignments or trees. I think the syntax can be much more compact
and easier to understand using some kind of pipes.

3. to Brian:
Promoting some of my recent changes to ape. We added better support for
extended Newick format (especially the "rich newick" format) to the
development version of ape. So ape should now understand most dialects of
"extended" Newick and these changes should ease handling networks. This
complements improvements (with Claudia Solis-Lemus and others) on the
tanggle package from bioconductor.

4. And last but not least, I'm going to release a new version (3.0) of
phangorn soonish. Please feel free to explore the package (
https://github.com/KlausVigo/phangorn) and give feedback. Improvements to
vignettes and man pages are especially welcome.
Should I switch from default import / export of phylip files to fasta, even
if this might break some code / packages?

Regards,
Klaus



On Fri, Jan 30, 2026 at 4:50 PM Brian O'Meara <[email protected]>
wrote:

> Thanks for your work on this (and DECADES of work maintaining and
> extending ape, plus all the work just keeping it on the CRAN treadmill).
>
> Some ideas:
>
> Ané and Sanderson (2005) could be a relevant paper for compression that
> incorporates the benefits that come from phylogenetic relatedness:
> https://doi.org/10.1080/10635150590905984 . Appendix A gets into the
> algorithm itself.
>
> A different approach for handling large data, especially with files on
> disk, is used by Arrow: https://arrow.apache.org/ . There's an associated
> R package if you're willing to add on dependencies:
> https://arrow.apache.org/docs/r/ .
>
> If I remember correctly, the EcoJulia team (https://github.com/EcoJulia)
> or the JuliaPhylo team (https://juliaphylo.github.io/JuliaPhyloWebsite/)
> is working on another way to store phylogenetic networks on disk. If you're
> working on tree storage and io already, maybe it could be worth
> coordinating with them on this in ape (and networks in general are cool).
> If I remember correctly, there are already a couple of competing "eNewick"
> formats and perhaps others, but I haven't dug into that in a bit. And, of
> course, the required link whenever someone talks about standards:
> https://xkcd.com/927/ .
>
> Thank you again,
> Brian
>
> From: R-sig-phylo <[email protected]> on behalf of
> Emmanuel Paradis <[email protected]>
> Date: Friday, January 30, 2026 at 06:54:02
> To: mailinglist R <[email protected]>
> Subject: Re: [R-sig-phylo] cleaning code in ape
>
> Dear all,
>
> Thanks for the suggestions so far! Here are two things I have had in mind
> for some time:
>
> 1) Compression of data objects (on the same model than the sparse matrices
> in package Matrix and others). For instance, if you do:
>
> library(ape)
> data(woodmouse)
> alview(woodmouse[, seg.sites(woodmouse, strict=TRUE)])
>
> you can see that it's possible to store only the sites which are different
> compared to the 1st sequence. That would compress the data by more than 3
> times, and the object could be analysed without uncompressing it (base
> frequencies, distances, ...) There may be a way to do it in a "smart" way
> (compressing the sequences sequentially depending on their similarity).
>
> Something similar might be feasable for trees.
>
> 2) Sequential access to large files (with caching). In some situations, it
> might be interesting to screen the sequences (and eventually drop some of
> them) before alignment (I'm thinking about users working viral sequences).
> Biostrings (in BioConductor) has this kind of functionality but maybe it'd
> be nice to have this in ape too (and now that we've re-introduced the
> function mafft() in ape that makes sense since MAFFT performs well with big
> alignments).
>
> The same could be useful for tree files too (eg, if someone has run a very
> long MCMC run).
>
> Best,
>
> Emmanuel
>
> ----- Le 29 Jan 26, à 10:24, Vojtěch Zeisek [email protected] a écrit :
>
> > Hello
> >
> > Dne úterý 27. ledna 2026 17:14:22, středoevropský standardní čas jste
> > napsal(a):
> >> Dear all,
> >> here are a few explanations of the suggested changes and some
> >> subjective comments.
> >
> > Thank You for the comments.
> >
> >> > this a perfect idea. :-) I'd love to see one two things:
> >> > 1) Support for parallelization whenever possible (various distance,
> >> > work with multi.whatever objects, ...) to speed things up.
> >>
> >> This is generally a good idea. The problem is that parallelization
> depends
> >> on the hardware (cluster, multicore machine) of the user, the operating
> >> system (usually easy on Linux, tricky on OSX and Windows) and
> >> additionally whether you run R inside a GUI or from the console.
> >> Additionally some matrix algebra code might already be parallelized.
> >> This depends on the BLAS library you use, so if you use parallelization
> >> on top you might slow down your computer.
> >> This is why I started using the future and future.apply packages
> >> in phangorn instead of mclapply. This puts the user in control to
> >> choose the parallelization framework to use and I don't need to
> >> check the operating system, number of cores, GUI etc.
> >> Some low level openMP stuff in C/C++ code might be still nice.
> >
> > Yeah, I know the parallelization is a difficult topic, and I don't know
> much
> > about macOS and Windows. I use to have generally a good experience with
> > future.apply. In any case I think we agree that we should add
> parallelization
> > whenever possible. :-)
> > IMHO all functions handling multiPhylo, or producing any sort of matrix,
> > indices etc. should have parallelization support.
> >
> >> > 2) Removal of all the *.mutliPhylo functions, i.e. IMHO the
> >> > best would be if every relevant function would support
> >> > phylo as well as multiPhylo objects. Now it's a bit confusing
> >> > whenever to use which function...
> >>
> >> I think there might be a misconception here. We introduced a lot of
> >> generic functions to ape, e.g. root, is.rooted, is.ultrametric etc.
> >> Now while there exist is.rooted.phylo() and is.rooted.multiPhylo(),
> users
> >> should only need to use is.rooted(x) and don't need to care about the
> phylo
> >> or multiPhylo versions. Maybe it is more a problem with the
> documentation?
> >
> > That's a good point. Yeah, it might be rather confusing feature of
> > documentation. From my experience, even more confusing is then that You
> can
> > run just plot() to plot a phylo object, but to get respective help You
> must
> > use ?plot.phylo...
> > Sincerely,
> > V.
> >
> >> > On Mon, Jan 26, 2026 at 9:31 AM Emmanuel Paradis wrote:
> >> > > Dear all,
> >> > > We are in the process of reviewing the "old" code in ape (some
> >> > > written in 2001). Here are a few things that came out recently:
> >> > > 1) During a recent discussion, we wondered if the option "..." of
> >> > > read.tree() is useful; it is passed internally to scan(). A review
> of
> >> > > the CRAN packages suggests this option is useless so it could
> >> > > be removed, at least without breaking those packages. There
> >> > > may be other bits of code that can be removed safely in other
> functions.
> >> > > 2) Printing of objects could be improved.
> >> > > 3) I've (re)introduced a function mafft() in ape. A function with
> the
> >> > > same name was formerly in ips which is now orphaned on CRAN.
> >> > > 4) A review of the man pages (help) would be useful. For instance,
> >> > > in ?read.tree one can read: "If there are two root edges (e.g.,
> >> > > "(((A:1,B:1):10):10);"), then the tree is not read and an error
> message
> >> > > is issued." [1] which is wrong since all types of Newick tree can be
> >> > > read. There are certainly similar outdated statements in the 300
> >> > > pages of the manual.
> >> > > 5) Klaus suggests to have more functions returning their "return
> >> > > value" invisibly to make easier the use of pipe operators (|> or
> %>%).
> >> > > Any thoughts, ideas, or comments are welcome.
> >> > > Best,
> >> > > Emmanuel
> >> > > [1] In version 5.8-1 currently on CRAN; now fixed on GitHub.
> > --
> > Vojtěch Zeisek
> > https://trapa.cz/en/
> >
> > Department of Botany, Faculty of Science
> > Charles University, Prague, Czech Republic
> > https://botany.natur.cuni.cz/
> >
> > Institute of Botany, Czech Academy of Sciences
> > Průhonice, Czech Republic
> > https://www.ibot.cas.cz/en/
> > Computing cluster
> > https://sorbus.ibot.cas.cz/en/start
> >
> > _______________________________________________
> > R-sig-phylo mailing list - [email protected]
> > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> > Searchable archive at
> http://www.mail-archive.com/[email protected]/
>
> _______________________________________________
> R-sig-phylo mailing list - [email protected]
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at
> http://www.mail-archive.com/[email protected]/
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-phylo mailing list - [email protected]
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at
> http://www.mail-archive.com/[email protected]/
>


-- 
Klaus Schliep

Senior Scientist
Institute of Molecular Biotechnology
TU Graz
https://www.imbt.tugraz.at

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - [email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/[email protected]/

Re: [R-sig-phylo] cleaning code in ape

Reply via email to