Dear all, I also want to comment on the recent bgPCA postings.
Andrea et al. and Fred are right that bgPCA produces ordination plots in which two or more groups are discriminated more (i.e., the groups overlap less) than they should, whenever p (number of variables) is large relative to n (sample size). Thanks Andrea for noticing that, or whoever figured it out first; it was not me, admittedly. In the case of samples from the same distribution (i.e., no "real" group differences), the samples can even appear to be distinct if p is larger than n. This phenomenon is much more severe in CVA than in bgPCA (as we showed in the 2011 paper), but we were not aware back then that it is also present in bgPCA. Please note that this does NOT mean that ALL results inferred from bgPCA are wrong, only those about group separation are biased; the relationship between group means in bgPCA is necessarily the same as in an ordinary PCA (but see below). I have two main comments and advices. 1) The simulations of identical independent noise for an increasing number of variables, as in Fred's current manuscript and in our 2011 paper, are not quite realistic because morphometric variables are highly correlated; the "real" degrees of freedom thus are much less than the number of variables. Put another way, if you set more and more landmarks on an sample of specimens, not every landmark introduces a new degree of freedom because its location may be predictable by the adjacent landmarks. Theoretically, there is a maximal number of degrees of freedoms in a given sample that reflects the actual spatial scale of the shape differences studied. If the given shape differences are captured well by the current landmark set, adding more landmarks will not add any further information and not increase the relevant degrees of freedom. For example, if shape variation comprises only affine shape variation (linear scaling and shearing), the relevant shape space has only two degrees of freedoms (two dimensions), regardless of how many landmarks were measured. As a result of this, most morphometric data, even those consisting of many landmarks, can be described well by a small number of principal components, as we all know. Ideally, these few PCs capture the "real" dimensionality of shape space (i.e., they are some rotation of the underlying factor structure), which is much less than the number of landmarks. In practice, the problem is that every landmarks entails some small independent measurement error, and hence the "cut-off" for the number of dimensions is not necessarily obvious. In the above example with only affine shape variation, for more than three landmarks there will still be more than two PCs with non-zero variance, but hopefully the first PCs are a good estimate of these non-affine components. Other methods than ordinary PCA may do a better job for this task, e.g. methods that take into account spatial scale, such as the spatially weighted relative warps in Bookstein's orange book or the relative intrinsic warps in Bookstein (2015). Blame Fred for these names ;-) Many multivariate statistical analyses - including bgPCA, CVA, relative PCA, and also the computation of shape distances or angles between shape trajectories, etc. - should be performed within this subspace (i.e., based on the first few PCs rather than on the original shape coordinates). bgPCA and CVA may be considered kinds of factor rotation within this subspace rather than methods of variable reduction. Hence, many of the problems described by Andrea et al. and Fred can be avoided by variable reduction (ordinary PCA) prior to bgPCA and related techniques. This requires a careful inspection of the scree plot and the corresponding PCs. The actual sample size must be large relative to the number of PCs retained (not necessarily relative to the number of landmarks). 2) Many applications of PCA or CVA aim to combine multiple analytical steps that are not necessarily commensurate: - Exploratory study of group mean differences - Relating multivariate mean differences across multiple groups by an ordination analysis - Discrimination analysis (studying if and to what degree groups overlap in their distribution of individual variation) - Perhaps even the estimation of a discrimination function, i.e., a combination of variables that maximally discriminates the groups. The value or burden of having many landmarks is different for each of these tasks. When "exploring" differences in average shape between groups, without strong prior expectations (i.e., without knowing where the signal is), it is clearly useful to measure as many landmarks as possible, as this increases spatial resolution. In contrast to Andrea, I think that "beautiful pictures" can be of value because morphology is a visual discipline, after all. For computing group means or shape regressions, p>n is no problem. The challenge in this step is to judge whether the observed differences are scientifically relevant, which may (but often does not) include the assessment of statistical significance. An excess of variables over cases can challenge statistical significance testing (multivariate parametric tests require full rank data and n>>p; for shape coordinates this ALWAYS requires dimension reduction, even for three landmarks). Only if group means really differ, it makes sense to relate multivariate group mean differences by an ordination analysis. This requires an interpretable metric (a "distance" function such as Procrustes distance), which is itself challenging and can constrain the geometric structures that are interpretable (e.g., Mitteroecker & Huttegger 2009, Huttegger & Mitteroecker 2011). Technically, this step sets no limits to the number of variables, but for normally distributed variables the expected value of a Euclidean distance increases linearly with the square root of the number of variables (chi distribution). This is no problem per se, but for small signals and many variables, the summed noise in the many landmarks can dominate the small signal. Also, this leads to a somewhat paradoxical situation: even if for each variable the estimated sample average is close to the population mean, the Euclidean distance between the multivariate sample average and the multivariate population mean increases with p. This relationship is also the reason why bgPC scores show too much group separation if p is large: the more variables, the larger the distance between group means, even though the within-group variances stay the same (for two groups, the squared Mahalanobis distance for the bgPC is approx. 2p/n). Perhaps more important than the _number_ of variables is the spatial distribution of landmarks on the organism. E.g., structures covered by many landmarks more strongly affect the multivariate distance than structures covered by less landmarks. Semilandmarks may or may not be helpful in this regard for a comprehensive coverage of organisms. Discrimination analysis (DA) aims at assessing the success of classification. Classification and discrimination require the estimation of variance for every linear combination of the variables and thus n>>p (in most multivariate settings, this implies prior variable reduction). Reliable DA also requires a cross-validation approach and cannot be inferred without bias from a standard ordination analysis: PCA tends to underestimate classification success (group separation), whereas bgPCA, and even more so CVA, tends to overestimate it. DA can be considered an exploratory approach, but it makes only sense if group means are known to differ. Discriminant function analysis (DFA), and its extension to multiple groups (CVA), estimate linear combinations of the measured variables that _maximize_ group separation and, hence, classification success. This goes beyond the exploratory analysis of group differences and not necessarily needs to be combined with an ordination analysis. Classically, it is used to derive a simple linear combination of the variables for efficient classification. It is well known for these methods that the within-sample classification success is a highly biased estimate of the out-of-sample classification success; hence the need for cross-validation. No single method can do all these steps well. The choice of method and also the choice of landmarks really depend on the biological question and the prior knowledge or hypotheses. If discrimination or classification is the primary aim, cross-validation is indispensable; an ordination analysis is not sufficient, perhaps not even necessary. If the signal (morphological difference) is known prior to the analysis, not many landmarks are necessary. Without any prior expectation, a dense landmark set may be necessary to explore shape variation. But this sets fundamental limits to studies of discrimination and classification; there is a kind of "uncertainty principle": for a given sample size, you cannot observe arbitrarily high spatial resolution (number of variables) and the exact discrimination of groups (classification success) at the same time. Best, Philipp M. -- MORPHMET may be accessed via its webpage at http://www.morphometrics.org --- You received this message because you are subscribed to the Google Groups "MORPHMET" group. To unsubscribe from this group and stop receiving emails from it, send an email to morphmet+unsubscr...@morphometrics.org.