[MORPHMET] Comment and advice on bgPCA

[email protected] Tue, 28 May 2019 11:38:39 -0700

Dear all,

I also want to comment on the recent bgPCA postings.


Andrea et al. and Fred are right that bgPCA produces ordination plots in 
which two or more groups are discriminated more (i.e., the groups overlap 
less) than they should, whenever p (number of variables) is large relative 
to n (sample size). Thanks Andrea for noticing that, or whoever figured it 
out first; it was not me, admittedly. In the case of samples from the same 
distribution (i.e., no "real" group differences), the samples can even 
appear to be distinct if p is larger than n. This phenomenon is much more 
severe in CVA than in bgPCA (as we showed in the 2011 paper), but we were 
not aware back then that it is also present in bgPCA. Please note that this 
does NOT mean that ALL results inferred from bgPCA are wrong, only those 
about group separation are biased; the relationship between group means in 
bgPCA is necessarily the same as in an ordinary PCA (but see below).

I have two main comments and advices.

1) The simulations of identical independent noise for an increasing number 
of variables, as in Fred's current manuscript and in our 2011 paper, are 
not quite realistic because morphometric variables are highly correlated; 
the "real" degrees of freedom thus are much less than the number of 
variables. Put another way, if you set more and more landmarks on an sample 
of specimens, not every landmark introduces a new degree of freedom because 
its location may be predictable by the adjacent landmarks. Theoretically, 
there is a maximal number of degrees of freedoms in a given sample that 
reflects the actual spatial scale of the shape differences studied. If the 
given shape differences are captured well by the current landmark set, 
adding more landmarks will not add any further information and not increase 
the relevant degrees of freedom. For example, if shape variation comprises 
only affine shape variation (linear scaling and shearing), the relevant 
shape space has only two degrees of freedoms (two dimensions), regardless 
of how many landmarks were measured.

As a result of this, most morphometric data, even those consisting of many 
landmarks, can be described well by a small number of principal components, 
as we all know. Ideally, these few PCs capture the "real" dimensionality of 
shape space (i.e., they are some rotation of the underlying factor 
structure), which is much less than the number of landmarks. In practice, 
the problem is that every landmarks entails some small independent 
measurement error, and hence the "cut-off" for the number of dimensions is 
not necessarily obvious. In the above example with only affine shape 
variation, for more than three landmarks there will still be more than two 
PCs with non-zero variance, but hopefully the first PCs are a good estimate 
of these non-affine components. Other methods than ordinary PCA may do a 
better job for this task, e.g. methods that take into account spatial 
scale, such as the spatially weighted relative warps in Bookstein's orange 
book or the relative intrinsic warps in Bookstein (2015). Blame Fred for 
these names ;-) 

Many multivariate statistical analyses - including bgPCA, CVA, relative 
PCA, and also the computation of shape distances or angles between shape 
trajectories, etc. - should be performed within this subspace (i.e., based 
on the first few PCs rather than on the original shape coordinates). bgPCA 
and CVA may be considered kinds of factor rotation within this subspace 
rather than methods of variable reduction.

Hence, many of the problems described by Andrea et al. and Fred can be 
avoided by variable reduction (ordinary PCA) prior to bgPCA and related 
techniques. This requires a careful inspection of the scree plot and the 
corresponding PCs. The actual sample size must be large relative to the 
number of PCs retained (not necessarily relative to the number of 
landmarks).


2) Many applications of PCA or CVA aim to combine multiple analytical steps 
that are not necessarily commensurate: 

- Exploratory study of group mean differences
- Relating multivariate mean differences across multiple groups by an 
ordination analysis
- Discrimination analysis (studying if and to what degree groups overlap in 
their distribution of individual variation)
- Perhaps even the estimation of a discrimination function, i.e., a 
combination of variables that maximally discriminates the groups.

The value or burden of having many landmarks is different for each of these 
tasks.

When "exploring" differences in average shape between groups, without 
strong prior expectations (i.e., without knowing where the signal is), it 
is clearly useful to measure as many landmarks as possible, as this 
increases spatial resolution. In contrast to Andrea, I think that 
"beautiful pictures" can be of value because morphology is a visual 
discipline, after all. For computing group means or shape regressions, p>n 
is no problem. The challenge in this step is to judge whether the observed 
differences are scientifically relevant, which may (but often does not) 
include the assessment of statistical significance. An excess of variables 
over cases can challenge statistical significance testing (multivariate 
parametric tests require full rank data and n>>p; for shape coordinates 
this ALWAYS requires dimension reduction, even for three landmarks). 

Only if group means really differ, it makes sense to relate multivariate 
group mean differences by an ordination analysis. This requires an 
interpretable metric (a "distance" function such as Procrustes distance), 
which is itself challenging and can constrain the geometric structures that 
are interpretable (e.g., Mitteroecker & Huttegger 2009, Huttegger & 
Mitteroecker 2011). Technically, this step sets no limits to the number of 
variables, but for normally distributed variables the expected value of a 
Euclidean distance increases linearly with the square root of the number of 
variables (chi distribution). This is no problem per se, but for small 
signals and many variables, the summed noise in the many landmarks can 
dominate the small signal. Also, this leads to a somewhat paradoxical 
situation: even if for each variable the estimated sample average is close 
to the population mean, the Euclidean distance between the multivariate 
sample average and the multivariate population mean increases with p. This 
relationship is also the reason why bgPC scores show too much group 
separation if p is large: the more variables, the larger the distance 
between group means, even though the within-group variances stay the same 
(for two groups, the squared Mahalanobis distance for the bgPC is approx. 
2p/n). 
Perhaps more important than the _number_ of variables is the spatial 
distribution of landmarks on the organism. E.g., structures covered by many 
landmarks more strongly affect the multivariate distance than structures 
covered by less landmarks. Semilandmarks may or may not be helpful in this 
regard for a comprehensive coverage of organisms. 

Discrimination analysis (DA) aims at assessing the success of 
classification. Classification and discrimination require the estimation of 
variance for every linear combination of the variables and thus n>>p (in 
most multivariate settings, this implies prior variable reduction). 
Reliable DA also requires a cross-validation approach and cannot be 
inferred without bias from a standard ordination analysis: PCA tends to 
underestimate classification success (group separation), whereas bgPCA, and 
even more so CVA, tends to overestimate it. DA can be considered an 
exploratory approach, but it makes only sense if group means are known to 
differ.

Discriminant function analysis (DFA), and its extension to multiple groups 
(CVA), estimate linear combinations of the measured variables that 
_maximize_ group separation and, hence, classification success. This goes 
beyond the exploratory analysis of group differences and not necessarily 
needs to be combined with an ordination analysis. Classically, it is used 
to derive a simple linear combination of the variables for efficient 
classification. It is well known  for these methods that the within-sample 
classification success is a highly biased estimate of the out-of-sample 
classification success; hence the need for cross-validation.

No single method can do all these steps well. The choice of method and also 
the choice of landmarks really depend on the biological question and the 
prior knowledge or hypotheses. If discrimination or classification is the 
primary aim, cross-validation is indispensable; an ordination analysis is 
not sufficient, perhaps not even necessary. If the signal (morphological 
difference) is known prior to the analysis, not many landmarks are 
necessary. Without any prior expectation, a dense landmark set may be 
necessary to explore shape variation. But this sets fundamental limits to 
studies of discrimination and  classification; there is a kind of 
"uncertainty principle": for a given sample size, you cannot observe 
arbitrarily high spatial resolution (number of variables) and the exact 
discrimination of groups (classification success) at the same time. 

Best,

Philipp M.

-- 
MORPHMET may be accessed via its webpage at http://www.morphometrics.org
--- 
You received this message because you are subscribed to the Google Groups 
"MORPHMET" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].

[MORPHMET] Comment and advice on bgPCA

Reply via email to