Re: [MORPHMET] Re: number of landmarks and sample size
Hi Will, I think you meant to say that you are writing a study design paper presenting results of simulations and power analysis to determine appropriate sample sizes for multivariate analyses in geometric morphometrics. But I would think that would have already been settled by now, and possibly would be more relevant for certain clustering methods. The only parameterized PCA variant I am aware of is Kernel PCA, which is a nonlinear PCA method used for pattern analysis (e.g. used in image analysis), but that is not often employed in biological geometric morphometrics papers (at least, those that I frequently come across). When kernels are used they usually are meant to estimate densities of reduced-dimensionality data like CS, or PCs as shape variables. Best, Justin Justin C. Bagley, Ph.D. Postdoctoral Scholar Plant Evolutionary Genomics Laboratory Department of Biology Virginia Commonwealth University Richmond, VA 23284-2012 jcbag...@vcu.edu Senior/Postdoctoral Research Associate Departamento de Zoologia Universidade de Brasília Campus Universitário Darcy Ribeiro 70910-900 Brasília, DF, Brasil Website: http://www.justinbagley.org Lattes CV: http://lattes.cnpq.br/0028570120872581 On Wed, May 31, 2017 at 6:41 PM, William Gelnaw wrote: > I'm currently working on a paper that deals with the problem of > over-parameterizing PCA in morphometrics. The recommendations that I'm > making in the paper are that you should try to have at least 3 times as > many samples as variables. That means that if you have 10 2D landmarks, > you should have at least 60 specimens that you measure. Based on > simulations, if you have fewer than 3 specimens per variable, you quickly > start getting eigenvalues for a PCA that are very different from known true > eigenvalues. I did a literature survey and about a quarter of > morphometrics studies in the last decade haven't met that standard. A good > way to test if you have enough samples is to do a jackknife analysis. If > you cut out about 10% of your observations and still get the same > eigenvalues, then your results are probably stable. > I hope this helps. > - Will > > On Wed, May 31, 2017 at 1:31 PM, mitte...@univie.ac.at < > mitte...@univie.ac.at> wrote: > >> Adding more (semi)landmarks inevitably increases the spatial resolution >> and thus allows one to capture finer anatomical details - whether relevant >> to the biological question or not. This can be advantageous for the >> reconstruction of shapes, especially when producing 3D morphs by warping >> dense surface representations. Basic developmental or evolutionary trends, >> group structures, etc., often are visible in an ordination analysis with a >> smaller set of relevant landmarks; finer anatomical resolution not >> necessarily affects these patterns. However, adding more landmarks cannot >> reduce or even remove any signals that were found with less landmarks, but >> it can make ordination analyses and the interpretation distances and angles >> in shape space more challenging. >> >> An excess of variables (landmarks) over specimens does NOT pose problems >> to statistical methods such as the computation of mean shapes and >> Procrustes distances, PCA, PLS, and the multivariate regression of shape >> coordinates on some independent variable (shape regression). These methods >> are based on averages or regressions computed for each variable separately, >> or on the decomposition of a covariance matrix. >> >> Other techniques, including Mahalanobis distance, DFA, CVA, CCA, and >> relative eigenanalysis require the inversions of a full-rank covariance >> matrix, which implies an access of specimens over variables. The same >> applies to many multivariate parametric test statistics, such as >> Hotelling's T2, Wilks' Lambda, etc. But shape coordinates are NEVER of full >> rank and thus can never be subjected to any of these methods without prior >> variable reduction. In fact, reliable results can only be obtained if there >> are manifold more specimens than variables, which usually requires variable >> reduction by PCA, PLS or other techniques, or the regularization of >> covariance matrices (which is more common in the bioinformatic community). >> >> For these reasons, I do not see any disadvantage of measuring a large >> number of landmarks, except for a waste of time perhaps. If life time is an >> issue, one can optimize landmark schemes as suggested by Jim or Aki. >> >> Best, >> >> Philipp >> >> -- >> MORPHMET may be accessed via its webpage at http://www.morphometrics.org >> --- >> You received this message because you are subscribed to the Google Groups >> "MORPHMET" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to morphmet+unsubscr...@morphometrics.org. >> > > -- > MORPHMET may be accessed via its webpage at http://www.morphometrics.org > --- > You received this message because you are subscribed to the Google Groups > "MORPHMET" group. > To unsubscri
Re: [MORPHMET] Re: number of landmarks and sample size
In discussions like these it would be helpful if the writer could clarify whether they are referring to the concepts of biological homology, topological homology or "semantic homology". These aren't the same things and the whole issue of “homology” in geometric morphometrics has always seemed, at least to me, to be very confused. For example, refer to the definitions of “homology” and “landmark” in the Glossary on the SB Morphometrics web site. Because it means different things to different specialists homology isn't a term to be thrown around as lightly as morphometricians seem prone to do. Imprecise and/or ambiguous usage renders the meaning of sentences difficult or impossible to understand for me and I suspect confuses others as well. Norm MacLeod > On 3 Jun 2017, at 08:53, alcardini wrote: > > Hi Philipp, > I am not worried about the number of variables (although I am not sure > one needs thousands of highly correlated points on a relatively simple > structure and seem to remember that Gunz and you suggest to start with > many and then reduce as appropriate). > > Regardless of whether point homology makes sense, I am worried that > many users believe that semilandmarks (maybe after sliding according > to purely mathematical principles) are the same as "traditional > landmarks" with a clear one-to-one correspondence. Even saying that > what's "homologous" is the curve or surface is tricky, because at the > end of the day that curve/surface is discretized using points, shape > distances are based on those points and there are many ways of placing > points with no clear "homology" (figure 7 of Oxnard & O'Higgins, > 2009); indeed, in a ontogenetic study of the cranial vault, for > instance, where sutures may become invisible in adults and therefore > cannot be used as a "boundary", semilandmarks close to the sutures may > end up on different bones in different stages/individuals. > > Semilandmarks are a fantastic tool, which I am happy to use when > needed, but they have their own limitations, which one should be aware > of. > Cheers > > Andrea > > > > On 03/06/2017, mitte...@univie.ac.at wrote: >> I think a few topics get mixed up here. >> >> Of course, a sample can be too small to be representative (as in Andrea's >> example), and one should think carefully about the measures to take. It is >> also clear that an increase in sample size reduces standard errors of >> statistical estimates, including that of a covariance matrix and its >> eigenvalues. But, as mentioned by Dean, the standard errors of the >> eigenvalues are of secondary interest in PCA. >> >> If one has a clear expectation about the signal in the data - and if one >> does not aim at new discoveries - a few specific measurements may suffice, >> perhaps even a few distance measurements. But effective exploratory >> analyses have always been a major strength of geometric morphometrics, >> enabled by the powerful visualization methods together with the large >> number of measured variables. >> >> Andrea, I am actually curious what worries you if one "collects between >> 2700 and 10 400 homologous landmarks from each rib" (whatever the term >> "homologous" is supposed to mean here)? >> >> Compared to many other disciplines in contemporary biology and biomedicine, >> >> a few thousand variables are not particularly many. Consider, for instance, >> >> 2D and 3D image analysis, FEA, and all the "omics", with millions and >> billions of variables. In my opinion, the challenge with these "big data" >> is not statistical power in testing a signal, but finding the signal - the >> low-dimensional subspace of interest - in the fist place. But this applies >> to 50 or 100 variables as well, not only to thousands or millions. If no >> prior expectation about this signal existed (which the mere presence of so >> many variables usually implies), no hypothesis test should be performed at >> all. The ignorance of this rule is one of the main reasons why so many GWAS >> >> and voxel-based morphometry studies fail to be replicable. >> >> Best wishes, >> >> Philipp >> >> -- >> MORPHMET may be accessed via its webpage at http://www.morphometrics.org >> --- >> You received this message because you are subscribed to the Google Groups >> "MORPHMET" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to morphmet+unsubscr...@morphometrics.org. >> > > > -- > > Dr. Andrea Cardini > Researcher, Dipartimento di Scienze Chimiche e Geologiche, Università > di Modena e Reggio Emilia, Via Campi, 103 - 41125 Modena - Italy > tel. 0039 059 2058472 > > Adjunct Associate Professor, School of Anatomy, Physiology and Human > Biology, The University of Western Australia, 35 Stirling Highway, > Crawley WA 6009, Australia > > E-mail address: alcard...@gmail.com, andrea.card...@unimore.it > WEBPAGE: https://sites.google.com/site/alcardini/home/main > > FREE Yellow BOOK on Geometric Morphometrics: > http://ww
Re: [MORPHMET] Re: number of landmarks and sample size
Hi Philipp, I am not worried about the number of variables (although I am not sure one needs thousands of highly correlated points on a relatively simple structure and seem to remember that Gunz and you suggest to start with many and then reduce as appropriate). Regardless of whether point homology makes sense, I am worried that many users believe that semilandmarks (maybe after sliding according to purely mathematical principles) are the same as "traditional landmarks" with a clear one-to-one correspondence. Even saying that what's "homologous" is the curve or surface is tricky, because at the end of the day that curve/surface is discretized using points, shape distances are based on those points and there are many ways of placing points with no clear "homology" (figure 7 of Oxnard & O'Higgins, 2009); indeed, in a ontogenetic study of the cranial vault, for instance, where sutures may become invisible in adults and therefore cannot be used as a "boundary", semilandmarks close to the sutures may end up on different bones in different stages/individuals. Semilandmarks are a fantastic tool, which I am happy to use when needed, but they have their own limitations, which one should be aware of. Cheers Andrea On 03/06/2017, mitte...@univie.ac.at wrote: > I think a few topics get mixed up here. > > Of course, a sample can be too small to be representative (as in Andrea's > example), and one should think carefully about the measures to take. It is > also clear that an increase in sample size reduces standard errors of > statistical estimates, including that of a covariance matrix and its > eigenvalues. But, as mentioned by Dean, the standard errors of the > eigenvalues are of secondary interest in PCA. > > If one has a clear expectation about the signal in the data - and if one > does not aim at new discoveries - a few specific measurements may suffice, > perhaps even a few distance measurements. But effective exploratory > analyses have always been a major strength of geometric morphometrics, > enabled by the powerful visualization methods together with the large > number of measured variables. > > Andrea, I am actually curious what worries you if one "collects between > 2700 and 10 400 homologous landmarks from each rib" (whatever the term > "homologous" is supposed to mean here)? > > Compared to many other disciplines in contemporary biology and biomedicine, > > a few thousand variables are not particularly many. Consider, for instance, > > 2D and 3D image analysis, FEA, and all the "omics", with millions and > billions of variables. In my opinion, the challenge with these "big data" > is not statistical power in testing a signal, but finding the signal - the > low-dimensional subspace of interest - in the fist place. But this applies > to 50 or 100 variables as well, not only to thousands or millions. If no > prior expectation about this signal existed (which the mere presence of so > many variables usually implies), no hypothesis test should be performed at > all. The ignorance of this rule is one of the main reasons why so many GWAS > > and voxel-based morphometry studies fail to be replicable. > > Best wishes, > > Philipp > > -- > MORPHMET may be accessed via its webpage at http://www.morphometrics.org > --- > You received this message because you are subscribed to the Google Groups > "MORPHMET" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to morphmet+unsubscr...@morphometrics.org. > -- Dr. Andrea Cardini Researcher, Dipartimento di Scienze Chimiche e Geologiche, Università di Modena e Reggio Emilia, Via Campi, 103 - 41125 Modena - Italy tel. 0039 059 2058472 Adjunct Associate Professor, School of Anatomy, Physiology and Human Biology, The University of Western Australia, 35 Stirling Highway, Crawley WA 6009, Australia E-mail address: alcard...@gmail.com, andrea.card...@unimore.it WEBPAGE: https://sites.google.com/site/alcardini/home/main FREE Yellow BOOK on Geometric Morphometrics: http://www.italian-journal-of-mammalogy.it/public/journals/3/issue_241_complete_100.pdf ESTIMATE YOUR GLOBAL FOOTPRINT: http://www.footprintnetwork.org/en/index.php/GFN/page/calculators/ -- MORPHMET may be accessed via its webpage at http://www.morphometrics.org --- You received this message because you are subscribed to the Google Groups "MORPHMET" group. To unsubscribe from this group and stop receiving emails from it, send an email to morphmet+unsubscr...@morphometrics.org.
Re: [MORPHMET] Re: number of landmarks and sample size
Hello, I'm an archaeologist who works on artifacts in North America. There are not many of us that use LGM, but even we can't seem to agree on how many LMs are appropriate. Because I use discriminant function analysis as the workhorse for discriminating groups of artifacts, I worry about the misuse of that technique. One thing I've read (e.g., Qiao et al. 2009) in regards to DFA is that too many variables (LMs) can affect its discriminatory power through data piling or the related phenomenon of overfitting. I have seen this in my practice but have not tested it rigorously. By reducing the number of LMs, I can sometimes get better discrimination between groups. Numbers of artifacts (specimens) is not a problem. I'm about to embark on a regional analysis using 1000's. Does anyone who understands this phenomenon better than I do care to comment? Thanks, Dave Thulman On Fri, Jun 2, 2017 at 6:12 PM, mitte...@univie.ac.at wrote: > I think a few topics get mixed up here. > > Of course, a sample can be too small to be representative (as in Andrea's > example), and one should think carefully about the measures to take. It is > also clear that an increase in sample size reduces standard errors of > statistical estimates, including that of a covariance matrix and its > eigenvalues. But, as mentioned by Dean, the standard errors of the > eigenvalues are of secondary interest in PCA. > > If one has a clear expectation about the signal in the data - and if one > does not aim at new discoveries - a few specific measurements may suffice, > perhaps even a few distance measurements. But effective exploratory > analyses have always been a major strength of geometric morphometrics, > enabled by the powerful visualization methods together with the large > number of measured variables. > > Andrea, I am actually curious what worries you if one "collects between > 2700 and 10 400 homologous landmarks from each rib" (whatever the term > "homologous" is supposed to mean here)? > > Compared to many other disciplines in contemporary biology and > biomedicine, a few thousand variables are not particularly many. Consider, > for instance, 2D and 3D image analysis, FEA, and all the "omics", with > millions and billions of variables. In my opinion, the challenge with these > "big data" is not statistical power in testing a signal, but finding the > signal - the low-dimensional subspace of interest - in the fist place. But > this applies to 50 or 100 variables as well, not only to thousands or > millions. If no prior expectation about this signal existed (which the mere > presence of so many variables usually implies), no hypothesis test should > be performed at all. The ignorance of this rule is one of the main reasons > why so many GWAS and voxel-based morphometry studies fail to be replicable. > > Best wishes, > > Philipp > > -- > MORPHMET may be accessed via its webpage at http://www.morphometrics.org > --- > You received this message because you are subscribed to the Google Groups > "MORPHMET" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to morphmet+unsubscr...@morphometrics.org. > -- MORPHMET may be accessed via its webpage at http://www.morphometrics.org --- You received this message because you are subscribed to the Google Groups "MORPHMET" group. To unsubscribe from this group and stop receiving emails from it, send an email to morphmet+unsubscr...@morphometrics.org.
Re: [MORPHMET] Re: number of landmarks and sample size
I think a few topics get mixed up here. Of course, a sample can be too small to be representative (as in Andrea's example), and one should think carefully about the measures to take. It is also clear that an increase in sample size reduces standard errors of statistical estimates, including that of a covariance matrix and its eigenvalues. But, as mentioned by Dean, the standard errors of the eigenvalues are of secondary interest in PCA. If one has a clear expectation about the signal in the data - and if one does not aim at new discoveries - a few specific measurements may suffice, perhaps even a few distance measurements. But effective exploratory analyses have always been a major strength of geometric morphometrics, enabled by the powerful visualization methods together with the large number of measured variables. Andrea, I am actually curious what worries you if one "collects between 2700 and 10 400 homologous landmarks from each rib" (whatever the term "homologous" is supposed to mean here)? Compared to many other disciplines in contemporary biology and biomedicine, a few thousand variables are not particularly many. Consider, for instance, 2D and 3D image analysis, FEA, and all the "omics", with millions and billions of variables. In my opinion, the challenge with these "big data" is not statistical power in testing a signal, but finding the signal - the low-dimensional subspace of interest - in the fist place. But this applies to 50 or 100 variables as well, not only to thousands or millions. If no prior expectation about this signal existed (which the mere presence of so many variables usually implies), no hypothesis test should be performed at all. The ignorance of this rule is one of the main reasons why so many GWAS and voxel-based morphometry studies fail to be replicable. Best wishes, Philipp -- MORPHMET may be accessed via its webpage at http://www.morphometrics.org --- You received this message because you are subscribed to the Google Groups "MORPHMET" group. To unsubscribe from this group and stop receiving emails from it, send an email to morphmet+unsubscr...@morphometrics.org.
RE: [MORPHMET] Re: number of landmarks and sample size
Just to comment. While it is worthwhile to investigate these issues, in my experience same sizes are limited not because investigators are NOT willing to measure more specimens, but there are no additional specimens to include in the analysis, especially for studies based on natural populations, or historical collections. M From: William Gelnaw [mailto:wgel...@gmail.com] Sent: Wednesday, May 31, 2017 3:41 PM To: mitte...@univie.ac.at Cc: MORPHMET Subject: Re: [MORPHMET] Re: number of landmarks and sample size I'm currently working on a paper that deals with the problem of over-parameterizing PCA in morphometrics. The recommendations that I'm making in the paper are that you should try to have at least 3 times as many samples as variables. That means that if you have 10 2D landmarks, you should have at least 60 specimens that you measure. Based on simulations, if you have fewer than 3 specimens per variable, you quickly start getting eigenvalues for a PCA that are very different from known true eigenvalues. I did a literature survey and about a quarter of morphometrics studies in the last decade haven't met that standard. A good way to test if you have enough samples is to do a jackknife analysis. If you cut out about 10% of your observations and still get the same eigenvalues, then your results are probably stable. I hope this helps. - Will On Wed, May 31, 2017 at 1:31 PM, mitte...@univie.ac.at<mailto:mitte...@univie.ac.at> mailto:mitte...@univie.ac.at>> wrote: Adding more (semi)landmarks inevitably increases the spatial resolution and thus allows one to capture finer anatomical details - whether relevant to the biological question or not. This can be advantageous for the reconstruction of shapes, especially when producing 3D morphs by warping dense surface representations. Basic developmental or evolutionary trends, group structures, etc., often are visible in an ordination analysis with a smaller set of relevant landmarks; finer anatomical resolution not necessarily affects these patterns. However, adding more landmarks cannot reduce or even remove any signals that were found with less landmarks, but it can make ordination analyses and the interpretation distances and angles in shape space more challenging. An excess of variables (landmarks) over specimens does NOT pose problems to statistical methods such as the computation of mean shapes and Procrustes distances, PCA, PLS, and the multivariate regression of shape coordinates on some independent variable (shape regression). These methods are based on averages or regressions computed for each variable separately, or on the decomposition of a covariance matrix. Other techniques, including Mahalanobis distance, DFA, CVA, CCA, and relative eigenanalysis require the inversions of a full-rank covariance matrix, which implies an access of specimens over variables. The same applies to many multivariate parametric test statistics, such as Hotelling's T2, Wilks' Lambda, etc. But shape coordinates are NEVER of full rank and thus can never be subjected to any of these methods without prior variable reduction. In fact, reliable results can only be obtained if there are manifold more specimens than variables, which usually requires variable reduction by PCA, PLS or other techniques, or the regularization of covariance matrices (which is more common in the bioinformatic community). For these reasons, I do not see any disadvantage of measuring a large number of landmarks, except for a waste of time perhaps. If life time is an issue, one can optimize landmark schemes as suggested by Jim or Aki. Best, Philipp -- MORPHMET may be accessed via its webpage at http://www.morphometrics.org --- You received this message because you are subscribed to the Google Groups "MORPHMET" group. To unsubscribe from this group and stop receiving emails from it, send an email to morphmet+unsubscr...@morphometrics.org<mailto:morphmet+unsubscr...@morphometrics.org>. -- MORPHMET may be accessed via its webpage at http://www.morphometrics.org --- You received this message because you are subscribed to the Google Groups "MORPHMET" group. To unsubscribe from this group and stop receiving emails from it, send an email to morphmet+unsubscr...@morphometrics.org<mailto:morphmet+unsubscr...@morphometrics.org>. -- MORPHMET may be accessed via its webpage at http://www.morphometrics.org --- You received this message because you are subscribed to the Google Groups "MORPHMET" group. To unsubscribe from this group and stop receiving emails from it, send an email to morphmet+unsubscr...@morphometrics.org.
RE: [MORPHMET] Re: number of landmarks and sample size
Will, I’m not quite sure what over-parameterizing means in the case of PCA, as it is simply a rigid-rotation of the dataspace and does not provide parameters for statistical inference. As for the distribution of eigenvalues, this of course is based on the underlying covariance matrix for the traits, which in turn will be affected by sample size. However, when traits become even mildly correlated (as is certainly the case for landmark coordinates), the distribution of eigenvalues of the covariance matrix becomes much better behaved. Specifically, the eigenvalues associated with low and high PC axes are less extreme than is observed with uncorrelated traits. That implies greater stability in their estimation, as the covariance matrix is further from singular (see the large statistical literature on the condition of a covariance matrix and subsequent estimation issues for ill-behaved covariance matrices). Best, Dean Dr. Dean C. Adams Professor Department of Ecology, Evolution, and Organismal Biology Department of Statistics Iowa State University www.public.iastate.edu/~dcadams/<http://www.public.iastate.edu/~dcadams/> phone: 515-294-3834 From: William Gelnaw [mailto:wgel...@gmail.com] Sent: Wednesday, May 31, 2017 5:41 PM To: mitte...@univie.ac.at Cc: MORPHMET Subject: Re: [MORPHMET] Re: number of landmarks and sample size I'm currently working on a paper that deals with the problem of over-parameterizing PCA in morphometrics. The recommendations that I'm making in the paper are that you should try to have at least 3 times as many samples as variables. That means that if you have 10 2D landmarks, you should have at least 60 specimens that you measure. Based on simulations, if you have fewer than 3 specimens per variable, you quickly start getting eigenvalues for a PCA that are very different from known true eigenvalues. I did a literature survey and about a quarter of morphometrics studies in the last decade haven't met that standard. A good way to test if you have enough samples is to do a jackknife analysis. If you cut out about 10% of your observations and still get the same eigenvalues, then your results are probably stable. I hope this helps. - Will On Wed, May 31, 2017 at 1:31 PM, mitte...@univie.ac.at<mailto:mitte...@univie.ac.at> mailto:mitte...@univie.ac.at>> wrote: Adding more (semi)landmarks inevitably increases the spatial resolution and thus allows one to capture finer anatomical details - whether relevant to the biological question or not. This can be advantageous for the reconstruction of shapes, especially when producing 3D morphs by warping dense surface representations. Basic developmental or evolutionary trends, group structures, etc., often are visible in an ordination analysis with a smaller set of relevant landmarks; finer anatomical resolution not necessarily affects these patterns. However, adding more landmarks cannot reduce or even remove any signals that were found with less landmarks, but it can make ordination analyses and the interpretation distances and angles in shape space more challenging. An excess of variables (landmarks) over specimens does NOT pose problems to statistical methods such as the computation of mean shapes and Procrustes distances, PCA, PLS, and the multivariate regression of shape coordinates on some independent variable (shape regression). These methods are based on averages or regressions computed for each variable separately, or on the decomposition of a covariance matrix. Other techniques, including Mahalanobis distance, DFA, CVA, CCA, and relative eigenanalysis require the inversions of a full-rank covariance matrix, which implies an access of specimens over variables. The same applies to many multivariate parametric test statistics, such as Hotelling's T2, Wilks' Lambda, etc. But shape coordinates are NEVER of full rank and thus can never be subjected to any of these methods without prior variable reduction. In fact, reliable results can only be obtained if there are manifold more specimens than variables, which usually requires variable reduction by PCA, PLS or other techniques, or the regularization of covariance matrices (which is more common in the bioinformatic community). For these reasons, I do not see any disadvantage of measuring a large number of landmarks, except for a waste of time perhaps. If life time is an issue, one can optimize landmark schemes as suggested by Jim or Aki. Best, Philipp -- MORPHMET may be accessed via its webpage at http://www.morphometrics.org --- You received this message because you are subscribed to the Google Groups "MORPHMET" group. To unsubscribe from this group and stop receiving emails from it, send an email to morphmet+unsubscr...@morphometrics.org<mailto:morphmet+unsubscr...@morphometrics.org>. -- MORPHMET may be accessed via its webpage at http://www.morphometrics.o
Re: [MORPHMET] Re: number of landmarks and sample size
I'm currently working on a paper that deals with the problem of over-parameterizing PCA in morphometrics. The recommendations that I'm making in the paper are that you should try to have at least 3 times as many samples as variables. That means that if you have 10 2D landmarks, you should have at least 60 specimens that you measure. Based on simulations, if you have fewer than 3 specimens per variable, you quickly start getting eigenvalues for a PCA that are very different from known true eigenvalues. I did a literature survey and about a quarter of morphometrics studies in the last decade haven't met that standard. A good way to test if you have enough samples is to do a jackknife analysis. If you cut out about 10% of your observations and still get the same eigenvalues, then your results are probably stable. I hope this helps. - Will On Wed, May 31, 2017 at 1:31 PM, mitte...@univie.ac.at < mitte...@univie.ac.at> wrote: > Adding more (semi)landmarks inevitably increases the spatial resolution > and thus allows one to capture finer anatomical details - whether relevant > to the biological question or not. This can be advantageous for the > reconstruction of shapes, especially when producing 3D morphs by warping > dense surface representations. Basic developmental or evolutionary trends, > group structures, etc., often are visible in an ordination analysis with a > smaller set of relevant landmarks; finer anatomical resolution not > necessarily affects these patterns. However, adding more landmarks cannot > reduce or even remove any signals that were found with less landmarks, but > it can make ordination analyses and the interpretation distances and angles > in shape space more challenging. > > An excess of variables (landmarks) over specimens does NOT pose problems > to statistical methods such as the computation of mean shapes and > Procrustes distances, PCA, PLS, and the multivariate regression of shape > coordinates on some independent variable (shape regression). These methods > are based on averages or regressions computed for each variable separately, > or on the decomposition of a covariance matrix. > > Other techniques, including Mahalanobis distance, DFA, CVA, CCA, and > relative eigenanalysis require the inversions of a full-rank covariance > matrix, which implies an access of specimens over variables. The same > applies to many multivariate parametric test statistics, such as > Hotelling's T2, Wilks' Lambda, etc. But shape coordinates are NEVER of full > rank and thus can never be subjected to any of these methods without prior > variable reduction. In fact, reliable results can only be obtained if there > are manifold more specimens than variables, which usually requires variable > reduction by PCA, PLS or other techniques, or the regularization of > covariance matrices (which is more common in the bioinformatic community). > > For these reasons, I do not see any disadvantage of measuring a large > number of landmarks, except for a waste of time perhaps. If life time is an > issue, one can optimize landmark schemes as suggested by Jim or Aki. > > Best, > > Philipp > > -- > MORPHMET may be accessed via its webpage at http://www.morphometrics.org > --- > You received this message because you are subscribed to the Google Groups > "MORPHMET" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to morphmet+unsubscr...@morphometrics.org. > -- MORPHMET may be accessed via its webpage at http://www.morphometrics.org --- You received this message because you are subscribed to the Google Groups "MORPHMET" group. To unsubscribe from this group and stop receiving emails from it, send an email to morphmet+unsubscr...@morphometrics.org.
Re: [MORPHMET] Re: number of landmarks and sample size
Dear All, I'd like to add a few comments on sampling (landmarks but also specimens). I hope that some of the other subscribers, who know much more than I do about morphometrics, will refine and correct my points. A very short one on my two papers. They make a very simple point: if one is landmarking just one side of a structure with object symmetry simply to speed up data collection, then mirror-reconstructing the missing side will make a nicer visualization and probably make shape data which are closer to those obtained by landmarking both sides. The difference may be tiny and I said "probably" because I am reporting results of empirical studies: out of 11-12 datasets, all but one had shape distances closer to those of the full bilateral landmark data after mirror-reconstructing the missing side. This did not work in one dataset which happened to have a very large amount of fluctuating asymmetry. To what extent these results are generalizable, I can't say but everyone can plan a small preliminary analysis to check it in her/his own data. I fully agree with Aki that, if time, money etc. are not a constraint, even when one is not interested in asymmetry, it is better to measure both sides. That's in fact true also for structures with matching symmetry. In terms of the choice of landmarks, I wish to stress (once more!) that quality may be more important than quantity: first one should think well about what she/he wants to measure, which will relate to the specific question being asked, and then decide about where and how many landmarks to use. There are at least two wonderful papers I suggested several times on this issue: Oxnard & O'Higgins, 2009, Biological Theory 4(1), 84–97. Klingenberg, 2008, Evol Biol 35:186–190 Then, especially for semilandmarks, I guess that as Aki (and others before) suggested, one can see what a good compromise is between information and the number of points (maybe considering also, but not principally, the visualization). For sample size, one should consider whether differences are presumably big (and a small sample might be OK...ish) or small (as in most microevolutionary studies, which generally require large N). I believe that Rohlf, already in the early days of geometric morphometrics, had written a software for exploring statistical power in shape data (TPSPower) but I am not sure if he kept developing it. In any case, power and sensitivity (to sampling) analyeses are certainly available in R. With small differences, although resampling methods may allow to perform tests even with tiny samples, power will be low and estimates (say, mean size and shape, variance and covariance etc.) will be likely inaccurate. Unfortunately, often, the most interesting taxa are rare populations (or fossils) for which specimens are difficult to find. A couple of people told me that there's an important paper coming out soon on sampling error in geometric morphometrics and it might suggest that one really needs huge samples. I would not be surprised and suspect that the few empirical studies we did (a couple of papers in Zoomorphology) were overoptimistic despite already suggesting (more or less) that one might need several dozens of specimens even when differences are relatively large and the number of landmarks was not particularly large. Again, they were empirical studies and one cannot say how generalizable they are. Anyway, I look forward to this new paper and hope it will be announced in MORPHMET, as well as I look forward to Aki's paper. Cheers Andrea On 29/05/17 18:35, Aki Watanabe wrote: Dear Lea, Unfortunately, there isn't (yet) a magic mathematical formula to determine whether you've sampled enough landmarks, but there are some exploratory approaches you can take to see if you're landmark sampling is converging to the "true" shape variation. One simple thing you can do is sample as many landmarks as you can on a representative sampling of specimens, then create a PC morphospace. Then, subsample the landmarks (e.g., 75%, 50%, 25% of the landmarks) and see if the PC morphospace from these subsampled datasets mirror the distribution of shapes of the full dataset. If the morphospaces begin deviating from the PC morphospace of the full dataset, then you have a visual cue that the subsampling is not adequately characterizing the shape variation of your specimens. In terms of a statistically significant test for landmark sampling, I suppose one can test for correlation between subsampled and full dataset, but because the subsampled and full dataset will be auto-correlated to some extent, the null would have to reflect this. Alternatively, I have a script that automatically subsamples the landmarks of a given dataset and creates a plot to see how well the subsampled datasets converge to the point distribution of the full dataset. If you are interested, I would be happy to describe the technique in more detail a