Thanks Brian, great review, as always! To add one bit: this paper looks at the effective sample size that should be used for BIC, in the standard BM model (univariate). https://projecteuclid.org/download/pdfview_1/euclid.aoas/1223908053
It gives a formula that depends on the tree shape and branch lengths. Like what Brian said: a pectinate tree would generally have a smaller effective sample size than a symmetric tree, for the same number of taxa. The general formula uses matrix, but the result should be something less than the number of taxa, and greater than: # branches stemming from the root * ratio (total tree height / length of shortest branch stemming from the root). The effective sample size should also be at least (total tree length / total tree height) for an ultrametric tree. See end of section 2 for an example of BIC penalties using effectives sample sizes. The bottom line is the same as what Brian said: - it’s generally unknown what “sample size” should be used - in cases when we know, the answer is complicated (it depends on the tree and on the model). With multivariate data (multiple sites), the effective sample size for univariate data (like number of taxa or something smaller) should be multiplied by the number of sites, if the model assumes that sites are independent and share the same evolutionary parameters. (consistent with what Brian said). Cécile On Sep 5, 2019, at 9:55 AM, Brian O'Meara <omeara.br...@gmail.com<mailto:omeara.br...@gmail.com>> wrote: Sample size is a weird thing in this area for AICc. For comparing DNA models in something like ModelTest, number of sites is used, but for OU/BM models, we typically use number of taxa. It's not resolved what's best. Posada and Buckley (2004, https://doi.org/10.1080/10635150490522304) have a discussion on this: Both in the AICc and the BIC descriptions above, the total number of characters was used as an estimate of sample size. However, effective sample sizes in phylogenetic studies are poorly understood, and depend on the quantity of interest (Churchill et al., 1992; Goldman, 1998; Morozov et al., 2000). Characters in an alignment will often not be independent, so using the total number of characters as a surrogate for sample size (Minin et al., 2003; Posada and Crandall, 2001b) could be an overestimate. Using only the number of variable sites as an estimate of sample size is a more conservative approach, but could be an underestimate (note that all sites are used when estimating base frequencies or the proportion of invariable sites). Indeed, sample size also depends on the number of taxa. Importantly, sample size can have an effect on the outcome of model selection with the AICc. In our example above, if we were to use the number of variable characters (301 sites) as the sample size, instead of the total number of characters (1927 sites), the best AICc model would not change, but the second and third AICc models would exchange their rankings. Furthermore, because the LRT, the AIC, and the BIC strategies rely on large sample asymptotics, it is also important to decide when a sample should be considered small. Although the AICc was derived under Gaussian assumptions, Burnham et al. (1994) found that this second order expression performed well in product multinomial models for open population capture-recapture. Burnham and Anderson (2003, p. 66) suggest using this correction when the sample size is small compared to the number of adjustable parameters, n/K < 40. Alternatively, and because AICc converges to the AIC with increasing n/K ratios, one could always use the AICc (D. Anderson, personal communications). Phylogenetic characters are mostly discrete, and the unconstrained model in phylogenetics is multinomial (Goldman, 1993). One may think of an alignment of nucleotide characters as a large and sparse contingency table with 4^T bins, where T is the number of taxa. For large sample asymptotics to hold in a contingency table every cell should contain, in general, more than 5 observations (see Agresti, 1990, p. 49, 244–250), which gives a rule of thumb of n/4^T > 5. Clearly, more research is needed on sample size in phylogenetics. Beaulieu et al. (2018, https://doi.org/10.1093/molbev/msy222; note my COI as I'm an author on this) did some simulations on a codon model testing different ways of counting sample size (number of sites, number of taxa, number of sites * number of taxa, etc.) and found that number of cells in the matrix (number of sites * number of taxa) seemed to work best to approximate Kullback-Liebler distance. For univariate models like that used in brownie.lite, number of cells is equal to number of taxa (since there's only one column): We note our use of AICc, as calculated in Burnham and Anderson (2002, p. 66) and as opposed to the standard AIC, in the above model comparisons. At the outset of our study it was unclear what the appropriate sample size n is when comparing models of sequence evolution. Building upon the work of Jhwueng et al. (2014), our simulations suggest that using the number of taxa times the number of sites as the sample size correction performed best as a small sample size correction for estimating Kullback–Liebler (KL) distance in phylogenetic models (Supporting Materials). This also has an intuitive appeal. In models that have at least some parameters shared across sites and some parameters shared across taxa, increasing the number of sites and/or taxa should be adding more samples for the parameters to estimate. This is consistent considering how likelihood is calculated for phylogenetic models: the likelihood for a given site is the sum of the probabilities of each observed state at each tip, which is then multiplied across sites. It is arguable that the conventional approach in comparative methods is calculating AICc in the same way. That is, if only one column of data (or “site”) is examined, as remains remarkably common in comparative methods, when we refer to sample size, it is technically the number of taxa multiplied by number of sites, even though it is referred to simply as the number of taxa. I suspect this is still not a great approximation. Compare a balanced tree (every internal node having two descendants) with every internal branch length the same versus a pectinate (caterpillar) tree where the two edges connecting to the root node are very long and the other edges are all near zero. For the same number of taxa and same number of sites, I bet the first tree has more meaningful data: the pectinate tree with those branch lengths will likely have all but one of the taxa having nearly identical states. So I think tree shape and branch lengths should matter for this. I've done some preliminary analyses on this, building on Beaulieu et al. (2018) and Jhwueng et al. (2014, https://doi.org/10.1515/sagmb-2013-0048, also note COI), but nothing definitive yet. It's also worth looking at Ho and Ané (2014, https://doi.org/10.1111/2041-210X.12285) who talk about AIC in the context of OU shifts, but who get into sample size with shifts in a modified BIC that uses taxa in different regimes as sample size (but again, univariate, so maybe it's actually matrix size). I also probably am missing important work by others -- my apologies if so. If you know of any, please let me know (and probably Karla, too!). So, in summary.... yeah, what Liam said: number of taxa, but it might be more complex. Best, Brian _______________________________________________________________________ Brian O'Meara, http://brianomeara.info, especially Calendar <http://brianomeara.info/calendar.html>, CV <http://brianomeara.info/cv.html>, and Feedback <http://brianomeara.info/feedback.html> Professor, Dept. of Ecology & Evolutionary Biology, UT Knoxville Associate Head, Dept. of Ecology & Evolutionary Biology, UT Knoxville He/Him/His On Thu, Sep 5, 2019 at 10:00 AM Liam Revell <liam.rev...@umb.edu<mailto:liam.rev...@umb.edu>> wrote: Dear Karla. In my opinion, it is probably correct to use the number of tips on the tree as the sample size for AICc when estimating the Brownian rate: as the number of independent pieces of information is n-1, just like with an ordinary variance. For other parameters in phylogenetic comparative analyses, the effective sample size may be different, however. All the best, Liam Liam J. Revell Associate Professor, University of Massachusetts Boston Profesor Asistente, Universidad Católica de la Ssma Concepción web: http://faculty.umb.edu/liam.revell/, http://www.phytools.org Academic Director UMass Boston Chile Abroad (starting 2019): https://www.umb.edu/academics/caps/international/biology_chile On 9/5/2019 9:49 AM, Karla Shikev wrote: [EXTERNAL SENDER] Thanks so much, Liam! Just one quick follow-up question: what do you suggest should be the sample size for transforming AIC into AICc? the number of tips on the tree? Karla On Thu, Sep 5, 2019 at 10:27 AM Liam Revell <liam.rev...@umb.edu> wrote: Dear Karla. You could try & create your own logLik method for the object class "brownie.lite" as follows: ## method logLik.brownie.lite<-function(object,...){ lik<-setNames( c(object$logL1,object$logL.multiple), c("single-rate","multi-rate")) attr(lik,"df")<-c(object$k1,object$k2) lik } ## fit model fit<-brownie.lite(tree,x) ## use it logLik(fit) AIC(fit) All the best, Liam Liam J. Revell Associate Professor, University of Massachusetts Boston Profesor Asistente, Universidad Católica de la Ssma Concepción web: http://faculty.umb.edu/liam.revell/, https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.phytools.org&data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107478464&sdata=ofsem4h4SNk6g6QFUwD%2BJKO3TsTArNfH9%2BAyYDEjCvY%3D&reserved=0 Academic Director UMass Boston Chile Abroad (starting 2019): https://www.umb.edu/academics/caps/international/biology_chile On 9/5/2019 9:13 AM, Karla Shikev wrote: [EXTERNAL SENDER] Dear all, I've been trying to use brownie.lite to implement the tutorial available here ( https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Ftreethinkers.org%2Ftutorials%2Fmorphological-evolution-in-r%2F&data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107478464&sdata=64k6WMtazzmyn0SLRrx2wEA%2F2wkk3%2B%2F3dBS0HtjlUT8%3D&reserved=0 ) to calculate model-averaged rates of evolution and for model selection (1 versus 2 rates). However, the current version of phytools 0.6-99 won't produce AICc estimates. Does anyone know a way around this? Any help would be greatly appreciated. thanks a bunch, Karla [[alternative HTML version deleted]] _______________________________________________ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-sig-phylo&data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107478464&sdata=ZZxjUW5cV1gb9De3yOjb54RCNlFv2WHWr01lnaeEf54%3D&reserved=0 Searchable archive at https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mail-archive.com%2Fr-sig-phylo%40r-project.org%2F&data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107488458&sdata=NUqbn4Yz9gYilJAs7K2mW%2BIANK1%2FmXcpvuIo0Q0h0hw%3D&reserved=0 [[alternative HTML version deleted]] _______________________________________________ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-sig-phylo&data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107488458&sdata=S0vvcWinbTdWb4T%2BwD9Fk7gFn6gdhpycbArMGgd7cYI%3D&reserved=0 Searchable archive at https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mail-archive.com%2Fr-sig-phylo%40r-project.org%2F&data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107488458&sdata=NUqbn4Yz9gYilJAs7K2mW%2BIANK1%2FmXcpvuIo0Q0h0hw%3D&reserved=0 _______________________________________________ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/ [[alternative HTML version deleted]] _______________________________________________ R-sig-phylo mailing list - R-sig-phylo@r-project.org<mailto:R-sig-phylo@r-project.org> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/ [[alternative HTML version deleted]] _______________________________________________ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/