Daniel, I think this is an instance of the well-known problem of "restriction of range", which can reduce correlation coefficients (a popular correction is known as Thorndike Case 2 and is implemented in rangeCorrection() in the "psych" package).
But isn't this different from the effect that Vanhove pointed out? As an aside, it appears that restricted ranges can also *increase* correlations in some circumstances (Wiseman, 1967; Zimmerman and Williams, 2000). João Wiseman, S. (1967). The effect of restriction of range upon correlation coefficients. British Journal of Educational Psychology, 37, 248–252. http://10.1111/j.2044-8279.1967.tb01933.x Zimmerman, D. W., & Williams, R. H. (2000). Restriction of range and correlation in outlier-prone distributions. Applied Psychological Measurement, 24, 267–280. http://dx.doi.org/10.1177/01466210022031741 On Tue, 2016-03-22 at 13:39 -0700, Daniel Ezra Johnson wrote: > I noticed in the earlier discussion an implication that by sampling > towards the extremes of a distribution, the correlation is somehow > inflated. I wouldn't say that -- it's larger, but it's probably > usually more accurate to say that by sampling only in the middle of a > distribution, the correlation is deflated. At least that's what I > learned from File:Correlation_range_dependence.svg ? > > On Tue, Mar 22, 2016 at 12:46 PM, Ambridge, Ben > <ben.ambri...@liverpool.ac.uk> wrote: > Hi everyone > > Thanks for your all your help! I probably should have given a > bit more information at the start: These particular data are > made up but represent the proportion of past vs present-tense > forms in (X axis) an input corpus and (Y axis) an elicited > production study. We deliberately chose highly past-biased or > present-biased verbs (though I’m now wondering if this was a > mistake). > > The impression I’m getting from the replies so far is that > this is probably OK - and that the effect we get is meaningful > - though its size will have been inflated by our choice of > highly biased verbs. > > We are now planning a follow-up study and would be interested > to hear people’s views on whether we should again choose > highly biased verbs (to make it more likely that we will > detect an effect if there is one to be found) or sample evenly > across the distribution (to get a more realistic estimate of > the size of any effect - not that this is particularly > important for our purposes). > > Thanks > Ben > > > On 21 Mar 2016, at 22:09, Scott Jackson <scott...@gmail.com> > wrote: > > Hi Ben, > > Just another perspective on what everyone else is saying: just > think carefully about why your distributions look like they > are. That is, each variable is extremely bimodal. Why? If > it's an artifact of how you sampled, then Jan's point is > especially relevant. If there is some underlying categorical > structure, like classes of "high" and "low" values for both > variables, then you probably just want something like a > contingency table, and the variation around each class "cloud" > could just be a type of measurement error. So maybe a simple > correlation isn't necessarily going to give you the "wrong" > inference, but it's missing something potentially very > important about how your data is structured. > > -scott > > > On Mon, Mar 21, 2016 at 3:58 PM, VANHOVE Jan > <jan.vanh...@unifr.ch> wrote: > Hello Ben, > > > An additional thing you may want to consider is that > extreme-group sampling can lead to systematically larger > effect sizes than ordinary sampling: > > > --- > library(MASS) > > sim.cors <- function(n = 100) { > dat <- mvrnorm(n, Sigma = matrix(c(1, 0.8, 0.8, 1), 2), mu = > c(0, 0)) > cor.tot <- cor(dat)[1, 2] > cor.reduced <- cor(dat[rank(dat[,1]) <= 10 | rank(dat[,1]) >= > 91, ])[1, 2] > return(list(cor.tot, cor.reduced)) > } > > cors <- replicate(1e3, sim.cors()) > > plot(unlist(cors[1, ]), unlist(cors[2, ]), > xlim = c(0.6, 1), ylim = c(0.6, 1), > xlab = "correlation full range", > ylab = "correlation extreme cases") > abline(a = 0, b = 1, lwd = 2, col = "blue") > > --- > > > I.e., for the same bivariate relationship, you'll end up with > larger correlation coefficients if you've sampled at the > extremes (largest and lowest values) than if you were > indiscriminate in your choice of x-values. This may be > important if you want to use the correlation coefficient to > communicate the strength of the bivariate relationship. > > > Regression coefficients don't show this effect (from an > earlier simulation I ran): > > > > > Cheers, > > Jan > From: ling-r-lang-l-boun...@mailman.ucsd.edu > <ling-r-lang-l-boun...@mailman.ucsd.edu> on behalf of > Ambridge, Ben <ben.ambri...@liverpool.ac.uk> > Sent: 21 March 2016 15:13 > To: ling-r-lang-l@mailman.ucsd.edu > Cc: Pine, Julian; Tatsumi, Tomoko > Subject: [R-lang] Correlation from extreme values > > Hi > > Is anyone able to advise on the following- probably very naive > - question? Is it problematic to run/interpret a correlation > that is driven by extreme values like this? > > Xaxis=c(1,3,3,5,6,8,85,87,90,92,97,98) > Yaxis=c(2,10,8,4,12,2,85,80,94,82,80,87) > Data = data.frame(Xaxis,Yaxis) > plot(Data$Xaxis, Data$Yaxis) > > And if it is problematic, would lme models (where these > datapoints represent - for example - the by-item means) also > inherit the same problems? > > Thanks > Ben > > <Rplot.jpeg> > > > > >