From: Fred L. Bookstein Seattle, August 6, 2017
Dear MorphMetters,
This note calls your attention to two new papers of mine that attempt to
rebuild the advanced part of GMM -- the part of the toolbox that
transforms linear multivariate statistical analysis of shape coordinates
from mere arithmetic into real understanding of an evolutionary or
developmental process.
One of the papers argues the importance for our work of a deep theorem
from 1967 that isn't in our textbooks yet. Suppose your data set
consists of a long list of shape coordinates over a sample whose count
of specimens is a small multiple of that shape coordinate count, say,
fewer than ten times the count of landmarks. Usually, in circumstances
like these, most of the dimensions of the shape coordinate space can be
modeled the way Dryden and Mardia indicated decades ago: as uncorrelated
Gaussians with the same small variance. And when _that_ is the case,
which is most of the time in the current GMM literature, the
multivariate statistical tests that we invoke most often will typically
lead to invalid evolutionary or developmental inferences and
interpretations. One common type of study cautioned by this warning is
the design that involves inverting a covariance matrix (in the course of
tasks like multiple regression, relative eigenanalysis, or MANOVA)
without examining all its principal components and their eigenvalues,
not just the largest few. Another large suspect group comprises the
studies that accept such a matrix as truth for some sort of
maximum-likelihood analysis or permutation analysis rather than formally
modeling it by a specific combination of biologically sensible factors
together with noise. My paper goes on to demonstrate alternative
approaches to biological understanding, including one that applies to
partial least squares analyses, and closes with six "imperatives" (terse
advisory slogans).
The other paper of this pair shows why principal components analysis of
shape coordinate data, an approach that has often been regarded as a
useful classification tool since the earliest days of GMM, should not
ever be trusted to arrive at a valid understanding of the organismal
process that interests you, however excellent your study design might
otherwise be. When people claim the opposite, they are expressing a
heartfelt wish for which there is no actual biological justification:
that maximizing the variance of a linear combination of Procrustes shape
coordinates validly conveys a meaning for the organism regardless of how
those landmarks and the attached shape-coordinate displacement vectors
might be situated over its idealized image. I go on to introduce and
demonstrate a novel alternative, varimax factor analysis of
bending-energy-adjusted partial warp scores, that I think our community
should explore as a candidate for the missing praxis.
Together the papers disqualify just about every statistic except for
shape regression that we ever taught you to compute once you had
produced your Procrustes shape coordinates. Both of the papers reassert
and extend my argument of recent years that Procrustes distance per se
is not a biologically meaningful quantity. To get to a valid biological
explanation from Procrustes shape coordinates, you need a pattern
language more powerful than what textbook multivariate statistics offers
us -- a pattern language that pays attention to the shape coordinate
averages (that is, to the mean landmark configuration) along with their
covariance matrix. An excerpt from the first of these papers might serve
as a good summary of both: "Linear multivariate analysis of shape
coordinate data is difficult not only computationally but also
conceptually. You should not permit your computer to make it seem easy
by oversimplifying either your questions or your answers."
Both papers are available via the preprint servers of the corresponding
journal websites. The first of the two, "A newly noticed formula
enforces fundamental limits on geometric morphometric analysis," has
DOI 10.1007/s11692-017-9424-9 or can be reached via the website for
Evolutionary Biology (note: this is a different journal from Journal of
Evolutionary Biology, so google it carefully). The second, "A method of
factor analysis for shape coordinates," has DOI 10.1002/ajpa.23277 and
also is accessible via the website of the American Journal of Physical
Anthropology.
Here are their abstracts.
For the first one:
The textbook literature of principal components analysis (PCA) dates
from a period when statistical computing was much less powerful than it
is today and the dimensionality of data sets typically processed by PCA
correspondingly much lower. When the formulas in those textbooks involve
limiting properties of PCA descriptors, the limit involved is usually
the indefinite increase of sample size for a fixed roster of variables.
But contemporary applications of PCA in organismal systems biology,
particularly in geometric morphometrics (GMM), generally involve much
greater counts of variables. The way one might expect pure noise to
degrade the biometric signal in this more contemporary context is
described by a different mathematical literature concerned with the
situation where the count of variables itself increases while remaining
proportional to the count of specimens. The founders of this literature
established a result of startling simplicity.
Consider steadily larger and larger data sets consisting of completely
uncorrelated standardized Gaussians (mean zero, variance 1) such that
the ratio of variables to cases (the so-called p/n ratio) is fixed at a
value y. Then the largest eigenvalue of their covariance matrix tends
to (1+\sqrt{y})^2, the smallest tends to (1-\sqrt{y})^2, and their ratio
tends to the limiting value ((1+\sqrt{y})/(1-\sqrt{y}))^2, whereas in
the uncorrelated model both of these eigenvalues and also their ratio
should be just 1.0. For y=1/4, not an atypical value for GMM data sets,
this ratio is 9; for y=1/2, which is still not atypical, it is 34.
These extrema and ratios, easily confirmed in simulations of realistic
size and consistent with real GMM findings in typical applied settings,
bear severe negative implications for any technique that involves
inverting a covariance structure on shape coordinates, including
multiple regression on shape, discriminant analysis by shape, canonical
variates analysis of shape, covariance distance analysis from shape, and
maximum-likelihood estimation of shape distributions that are not
constrained by strong prior models. The theorem also suggests that we
should use extreme caution whenever considering a biological
interpretation of any Partial Least Squares analysis involving large
numbers of landmarks or semilandmarks. I illuminate these concerns with
the aid of one simulation, two explicit reanalyses of previously
published data, and several little sermons.
For the second one:
Currently the most common reporting style for a geometric morphometric
(GMM) analysis of anthropological data begins with the principal
components of the shape coordinates to which the original landmark data
have been converted. But this focus often frustrates the organismal
biologist, mainly because principal component analysis (PCA) is not
aimed at scientific interpretability of the loading patterns actually
uncovered. The difficulty of making biological sense of a PCA is
heightened by aspects of the shape coordinate setting that further
diverge from our intuitive expectations of how morphometric measurements
ought to combine. More than fifty years ago one of our sister
disciplines, psychometrics, managed to build an algorithmic route from
principal component analysis to scientific understanding via the toolkit
generally known as factor analysis. This article introduces a
modification of one standard factor-analysis approach, Henry Kaiser's
varimax rotation of 1958, that accommodates two of the major differences
between the GMM context and the psychometric context for these
approaches: the coexistence of "general" and "special" factors of form
as adumbrated by Sewall Wright, and the typical loglinearity of partial
warp variance as a function of bending energy. I briefly explain the
history of principal components in biometrics and the contrast with
factor analysis, introduce the modified varimax algorithm I am
recommending, and work three examples that are reanalyses of previously
published cranial data sets. A closing discussion emphasizes the
desirability of superseding PCA by algorithms aimed at anthropological
understanding rather than classification or ordination.
--
MORPHMET may be accessed via its webpage at http://www.morphometrics.org
---
You received this message because you are subscribed to the Google Groups "MORPHMET" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to morphmet+unsubscr...@morphometrics.org.