Re: [R-sig-phylo] model averaging using brownie.lite

Cecile Ane Thu, 05 Sep 2019 09:08:29 -0700

Thanks Brian, great review, as always!

To add one bit: this paper looks at the effective sample size that should be 
used for BIC, in the standard BM model (univariate).
https://projecteuclid.org/download/pdfview_1/euclid.aoas/1223908053


It gives a formula that depends on the tree shape and branch lengths. Like what 
Brian said: a pectinate tree would generally have a smaller effective sample 
size than a symmetric tree, for the same number of taxa. The general formula 
uses matrix, but the result should be something less than the number of taxa, 
and greater than: # branches stemming from the root * ratio (total tree height 
/ length of shortest branch stemming from the root). The effective sample size 
should also be at least (total tree length / total tree height) for an 
ultrametric tree. See end of section 2 for an example of BIC penalties using 
effectives sample sizes.

The bottom line is the same as what Brian said:
- it’s generally unknown what “sample size” should be used
- in cases when we know, the answer is complicated (it depends on the tree and 
on the model).

With multivariate data (multiple sites), the effective sample size for 
univariate data (like number of taxa or something smaller) should be multiplied 
by the number of sites, if the model assumes that sites are independent and 
share the same evolutionary parameters. (consistent with what Brian said).

Cécile

On Sep 5, 2019, at 9:55 AM, Brian O'Meara 
<omeara.br...@gmail.com<mailto:omeara.br...@gmail.com>> wrote:

Sample size is a weird thing in this area for AICc. For comparing DNA
models in something like ModelTest, number of sites is used, but for OU/BM
models, we typically use number of taxa. It's not resolved what's best.

Posada and Buckley (2004, https://doi.org/10.1080/10635150490522304) have a
discussion on this:

Both in the AICc and the BIC descriptions above, the total number of
characters was used as an estimate of sample size. However, effective
sample sizes in phylogenetic studies are poorly understood, and depend on
the quantity of interest (Churchill et al., 1992; Goldman, 1998; Morozov et
al., 2000). Characters in an alignment will often not be independent, so
using the total number of characters as a surrogate for sample size (Minin
et al., 2003; Posada and Crandall, 2001b) could be an overestimate. Using
only the number of variable sites as an estimate of sample size is a more
conservative approach, but could be an underestimate (note that all sites
are used when estimating base frequencies or the proportion of invariable
sites). Indeed, sample size also depends on the number of taxa.
Importantly, sample size can have an effect on the outcome of model
selection with the AICc. In our example above, if we were to use the number
of variable characters (301 sites) as the sample size, instead of the total
number of characters (1927 sites), the best AICc model would not change,
but the second and third AICc models would exchange their rankings.
Furthermore, because the LRT, the AIC, and the BIC strategies rely on large
sample asymptotics, it is also important to decide when a sample should be
considered small. Although the AICc was derived under Gaussian assumptions,
Burnham et al. (1994) found that this second order expression performed
well in product multinomial models for open population capture-recapture.
Burnham and Anderson (2003, p. 66) suggest using this correction when the
sample size is small compared to the number of adjustable parameters, n/K <
40. Alternatively, and because AICc converges to the AIC with increasing
n/K ratios, one could always use the AICc (D. Anderson, personal
communications). Phylogenetic characters are mostly discrete, and the
unconstrained model in phylogenetics is multinomial (Goldman, 1993). One
may think of an alignment of nucleotide characters as a large and sparse
contingency table with 4^T bins, where T is the number of taxa. For large
sample asymptotics to hold in a contingency table every cell should
contain, in general, more than 5 observations (see Agresti, 1990, p. 49,
244–250), which gives a rule of thumb of n/4^T > 5. Clearly, more research
is needed on sample size in phylogenetics.

Beaulieu et al. (2018, https://doi.org/10.1093/molbev/msy222; note my COI
as I'm an author on this) did some simulations on a codon model testing
different ways of counting sample size (number of sites, number of taxa,
number of sites * number of taxa, etc.) and found that number of cells in
the matrix (number of sites * number of taxa) seemed to work best to
approximate Kullback-Liebler distance. For univariate models like that used
in brownie.lite, number of cells is equal to number of taxa (since there's
only one column):

We note our use of AICc, as calculated in Burnham and Anderson (2002, p.
66) and as opposed to the standard AIC, in the above model comparisons. At
the outset of our study it was unclear what the appropriate sample size n
is when comparing models of sequence evolution. Building upon the work of
Jhwueng et al. (2014), our simulations suggest that using the number of
taxa times the number of sites as the sample size correction performed best
as a small sample size correction for estimating Kullback–Liebler (KL)
distance in phylogenetic models (Supporting Materials). This also has an
intuitive appeal. In models that have at least some parameters shared
across sites and some parameters shared across taxa, increasing the number
of sites and/or taxa should be adding more samples for the parameters to
estimate. This is consistent considering how likelihood is calculated for
phylogenetic models: the likelihood for a given site is the sum of the
probabilities of each observed state at each tip, which is then multiplied
across sites. It is arguable that the conventional approach in comparative
methods is calculating AICc in the same way. That is, if only one column of
data (or “site”) is examined, as remains remarkably common in comparative
methods, when we refer to sample size, it is technically the number of taxa
multiplied by number of sites, even though it is referred to simply as the
number of taxa.

I suspect this is still not a great approximation. Compare a balanced tree
(every internal node having two descendants) with every internal branch
length the same versus a pectinate (caterpillar) tree where the two edges
connecting to the root node are very long and the other edges are all near
zero. For the same number of taxa and same number of sites, I bet the first
tree has more meaningful data: the pectinate tree with those branch lengths
will likely have all but one of the taxa having nearly identical states. So
I think tree shape and branch lengths should matter for this. I've done
some preliminary analyses on this, building on Beaulieu et al. (2018) and
Jhwueng et al. (2014,  https://doi.org/10.1515/sagmb-2013-0048, also note
COI), but nothing definitive yet.

It's also worth looking at Ho and Ané (2014,
https://doi.org/10.1111/2041-210X.12285) who talk about AIC in the context
of OU shifts, but who get into sample size with shifts in a modified BIC
that uses taxa in different regimes as sample size (but again, univariate,
so maybe it's actually matrix size).

I also probably am missing important work by others -- my apologies if so.
If you know of any, please let me know (and probably Karla, too!).

So, in summary.... yeah, what Liam said: number of taxa, but it might be
more complex.

Best,
Brian

_______________________________________________________________________
Brian O'Meara, http://brianomeara.info, especially Calendar
<http://brianomeara.info/calendar.html>, CV
<http://brianomeara.info/cv.html>, and Feedback
<http://brianomeara.info/feedback.html>

Professor, Dept. of Ecology & Evolutionary Biology, UT Knoxville
Associate Head, Dept. of Ecology & Evolutionary Biology, UT Knoxville
He/Him/His



On Thu, Sep 5, 2019 at 10:00 AM Liam Revell 
<liam.rev...@umb.edu<mailto:liam.rev...@umb.edu>> wrote:

Dear Karla.

In my opinion, it is probably correct to use the number of tips on the
tree as the sample size for AICc when estimating the Brownian rate: as
the number of independent pieces of information is n-1, just like with
an ordinary variance. For other parameters in phylogenetic comparative
analyses, the effective sample size may be different, however.

All the best, Liam

Liam J. Revell
Associate Professor, University of Massachusetts Boston
Profesor Asistente, Universidad Católica de la Ssma Concepción
web: http://faculty.umb.edu/liam.revell/, http://www.phytools.org

Academic Director UMass Boston Chile Abroad (starting 2019):
https://www.umb.edu/academics/caps/international/biology_chile

On 9/5/2019 9:49 AM, Karla Shikev wrote:
[EXTERNAL SENDER]

Thanks so much, Liam! Just one quick follow-up question: what do you
suggest should be the sample size for transforming AIC into AICc? the
number of tips on the tree?

Karla

On Thu, Sep 5, 2019 at 10:27 AM Liam Revell <liam.rev...@umb.edu> wrote:

Dear Karla.

You could try & create your own logLik method for the object class
"brownie.lite" as follows:

## method
logLik.brownie.lite<-function(object,...){
        lik<-setNames(
                c(object$logL1,object$logL.multiple),
                c("single-rate","multi-rate"))
        attr(lik,"df")<-c(object$k1,object$k2)
        lik
}
## fit model
fit<-brownie.lite(tree,x)
## use it
logLik(fit)
AIC(fit)

All the best, Liam

Liam J. Revell
Associate Professor, University of Massachusetts Boston
Profesor Asistente, Universidad Católica de la Ssma Concepción
web: http://faculty.umb.edu/liam.revell/,
https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.phytools.org&amp;data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107478464&amp;sdata=ofsem4h4SNk6g6QFUwD%2BJKO3TsTArNfH9%2BAyYDEjCvY%3D&amp;reserved=0

Academic Director UMass Boston Chile Abroad (starting 2019):
https://www.umb.edu/academics/caps/international/biology_chile

On 9/5/2019 9:13 AM, Karla Shikev wrote:
[EXTERNAL SENDER]

Dear all,

I've been trying to use brownie.lite to implement the tutorial
available
here (

https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Ftreethinkers.org%2Ftutorials%2Fmorphological-evolution-in-r%2F&amp;data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107478464&amp;sdata=64k6WMtazzmyn0SLRrx2wEA%2F2wkk3%2B%2F3dBS0HtjlUT8%3D&amp;reserved=0
)
to
calculate model-averaged rates of evolution and for model selection (1
versus 2 rates). However, the current version of phytools 0.6-99 won't
produce AICc estimates. Does anyone know a way around this? Any help
would
be greatly appreciated.

thanks a bunch,

Karla

         [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org


https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-sig-phylo&amp;data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107478464&amp;sdata=ZZxjUW5cV1gb9De3yOjb54RCNlFv2WHWr01lnaeEf54%3D&amp;reserved=0
Searchable archive at

https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mail-archive.com%2Fr-sig-phylo%40r-project.org%2F&amp;data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107488458&amp;sdata=NUqbn4Yz9gYilJAs7K2mW%2BIANK1%2FmXcpvuIo0Q0h0hw%3D&amp;reserved=0



        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org

https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-sig-phylo&amp;data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107488458&amp;sdata=S0vvcWinbTdWb4T%2BwD9Fk7gFn6gdhpycbArMGgd7cYI%3D&amp;reserved=0
Searchable archive at
https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mail-archive.com%2Fr-sig-phylo%40r-project.org%2F&amp;data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107488458&amp;sdata=NUqbn4Yz9gYilJAs7K2mW%2BIANK1%2FmXcpvuIo0Q0h0hw%3D&amp;reserved=0

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at
http://www.mail-archive.com/r-sig-phylo@r-project.org/


[[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - 
R-sig-phylo@r-project.org<mailto:R-sig-phylo@r-project.org>
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/


        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] model averaging using brownie.lite

Reply via email to