
I think this is an instance of the well-known problem of "restriction of
range", which can reduce correlation coefficients (a popular correction
is known as Thorndike Case 2 and is implemented in rangeCorrection() in
the "psych" package).

But isn't this different from the effect that Vanhove pointed out?

As an aside, it appears that restricted ranges can also *increase*
correlations in some circumstances (Wiseman, 1967; Zimmerman and
Williams, 2000).


Wiseman, S. (1967). The effect of restriction of range upon correlation
coefficients. British Journal of Educational Psychology, 37, 248–252.

Zimmerman, D. W., & Williams, R. H. (2000). Restriction of range and
correlation in outlier-prone distributions. Applied Psychological
Measurement, 24, 267–280. http://dx.doi.org/10.1177/01466210022031741

> I noticed in the earlier discussion an implication that by sampling
> towards the extremes of a distribution, the correlation is somehow
> inflated. I wouldn't say that -- it's larger, but it's probably
> usually more accurate to say that by sampling only in the middle of a
> distribution, the correlation is deflated. At least that's what I
> learned from File:Correlation_range_dependence.svg ?
>         Hi everyone
>         Thanks for your all your help! I probably should have given a
>         bit more information at the start: These particular data are
>         made up but represent the proportion of past vs present-tense
>         forms in (X axis) an input corpus and (Y axis) an elicited
>         production study. We deliberately chose highly past-biased or
>         present-biased verbs (though I’m now wondering if this was a
>         mistake).
>         The impression I’m getting from the replies so far is that
>         this is probably OK - and that the effect we get is meaningful
>         - though its size will have been inflated by our choice of
>         highly biased verbs.
>         We are now planning a follow-up study and would be interested
>         to hear people’s views on whether we should again choose
>         highly biased verbs (to make it more likely that we will
>         detect an effect if there is one to be found) or sample evenly
>         across the distribution (to get a more realistic estimate of
>         the size of any effect - not that this is particularly
>         important for our purposes).
>         Thanks
>         Ben
>         Hi Ben,
>         Just another perspective on what everyone else is saying: just
>         think carefully about why your distributions look like they
>         are.  That is, each variable is extremely bimodal.  Why?  If
>         it's an artifact of how you sampled, then Jan's point is
>         especially relevant.  If there is some underlying categorical
>         structure, like classes of "high" and "low" values for both
>         variables, then you probably just want something like a
>         contingency table, and the variation around each class "cloud"
>         could just be a type of measurement error.  So maybe a simple
>         correlation isn't necessarily going to give you the "wrong"
>         inference, but it's missing something potentially very
>         important about how your data is structured.
>         -scott
>         Hello Ben,
>         An additional thing you may want to consider is that
>         extreme-group sampling can lead to systematically larger
>         effect sizes than ordinary sampling:
>         ---
>         library(MASS)
>         sim.cors <- function(n = 100) {
>         dat <- mvrnorm(n, Sigma = matrix(c(1, 0.8, 0.8, 1), 2), mu =
>         c(0, 0))
>         cor.tot <- cor(dat)[1, 2]
>         cor.reduced <- cor(dat[rank(dat[,1]) <= 10 | rank(dat[,1]) >=
>         91, ])[1, 2]
>         return(list(cor.tot, cor.reduced))
>         }
>         cors <- replicate(1e3, sim.cors())
>         plot(unlist(cors[1, ]), unlist(cors[2, ]),
>              xlim = c(0.6, 1), ylim = c(0.6, 1),
>              xlab = "correlation full range",
>              ylab = "correlation extreme cases")
>         abline(a = 0, b = 1, lwd = 2, col = "blue")
>         ---
>         I.e., for the same bivariate relationship, you'll end up with
>         larger correlation coefficients if you've sampled at the
>         extremes (largest and lowest values) than if you were
>         indiscriminate in your choice of x-values. This may be
>         important if you want to use the correlation coefficient to
>         communicate the strength of the bivariate relationship.
>         Regression coefficients don't show this effect (from an
>         earlier simulation I ran):
>         Cheers,
>         Jan
>         Hi
>         Is anyone able to advise on the following- probably very naive
>         - question? Is it problematic to run/interpret a correlation
>         that is driven by extreme values like this?
>         Xaxis=c(1,3,3,5,6,8,85,87,90,92,97,98)
>         Yaxis=c(2,10,8,4,12,2,85,80,94,82,80,87)
>         Data = data.frame(Xaxis,Yaxis)
>         plot(Data$Xaxis, Data$Yaxis)
>         And if it is problematic, would lme models (where these
>         datapoints represent - for example - the by-item means) also
>         inherit the same problems?
>         Thanks
>         Ben
>         <Rplot.jpeg>

