[R-lang] Re: Correlation from extreme values

João Veríssimo Tue, 22 Mar 2016 14:55:49 -0700

Daniel,

I think this is an instance of the well-known problem of "restriction of
range", which can reduce correlation coefficients (a popular correction
is known as Thorndike Case 2 and is implemented in rangeCorrection() in
the "psych" package).


But isn't this different from the effect that Vanhove pointed out?

As an aside, it appears that restricted ranges can also *increase*
correlations in some circumstances (Wiseman, 1967; Zimmerman and
Williams, 2000).

João


Wiseman, S. (1967). The effect of restriction of range upon correlation
coefficients. British Journal of Educational Psychology, 37, 248–252.
http://10.1111/j.2044-8279.1967.tb01933.x

Zimmerman, D. W., & Williams, R. H. (2000). Restriction of range and
correlation in outlier-prone distributions. Applied Psychological
Measurement, 24, 267–280. http://dx.doi.org/10.1177/01466210022031741


On Tue, 2016-03-22 at 13:39 -0700, Daniel Ezra Johnson wrote:
> I noticed in the earlier discussion an implication that by sampling
> towards the extremes of a distribution, the correlation is somehow
> inflated. I wouldn't say that -- it's larger, but it's probably
> usually more accurate to say that by sampling only in the middle of a
> distribution, the correlation is deflated. At least that's what I
> learned from File:Correlation_range_dependence.svg ?
> 
> On Tue, Mar 22, 2016 at 12:46 PM, Ambridge, Ben
> <ben.ambri...@liverpool.ac.uk> wrote:
>         Hi everyone
>         
>         Thanks for your all your help! I probably should have given a
>         bit more information at the start: These particular data are
>         made up but represent the proportion of past vs present-tense
>         forms in (X axis) an input corpus and (Y axis) an elicited
>         production study. We deliberately chose highly past-biased or
>         present-biased verbs (though I’m now wondering if this was a
>         mistake).
>         
>         The impression I’m getting from the replies so far is that
>         this is probably OK - and that the effect we get is meaningful
>         - though its size will have been inflated by our choice of
>         highly biased verbs.
>         
>         We are now planning a follow-up study and would be interested
>         to hear people’s views on whether we should again choose
>         highly biased verbs (to make it more likely that we will
>         detect an effect if there is one to be found) or sample evenly
>         across the distribution (to get a more realistic estimate of
>         the size of any effect - not that this is particularly
>         important for our purposes).
>         
>         Thanks
>         Ben
>         
>         
>         On 21 Mar 2016, at 22:09, Scott Jackson <scott...@gmail.com>
>         wrote:
>         
>         Hi Ben,
>         
>         Just another perspective on what everyone else is saying: just
>         think carefully about why your distributions look like they
>         are.  That is, each variable is extremely bimodal.  Why?  If
>         it's an artifact of how you sampled, then Jan's point is
>         especially relevant.  If there is some underlying categorical
>         structure, like classes of "high" and "low" values for both
>         variables, then you probably just want something like a
>         contingency table, and the variation around each class "cloud"
>         could just be a type of measurement error.  So maybe a simple
>         correlation isn't necessarily going to give you the "wrong"
>         inference, but it's missing something potentially very
>         important about how your data is structured.
>         
>         -scott
>         
>         
>         On Mon, Mar 21, 2016 at 3:58 PM, VANHOVE Jan
>         <jan.vanh...@unifr.ch> wrote:
>         Hello Ben,
>         
>         
>         An additional thing you may want to consider is that
>         extreme-group sampling can lead to systematically larger
>         effect sizes than ordinary sampling:
>         
>         
>         ---
>         library(MASS)
>         
>         sim.cors <- function(n = 100) {
>         dat <- mvrnorm(n, Sigma = matrix(c(1, 0.8, 0.8, 1), 2), mu =
>         c(0, 0))
>         cor.tot <- cor(dat)[1, 2]
>         cor.reduced <- cor(dat[rank(dat[,1]) <= 10 | rank(dat[,1]) >=
>         91, ])[1, 2]
>         return(list(cor.tot, cor.reduced))
>         }
>         
>         cors <- replicate(1e3, sim.cors())
>         
>         plot(unlist(cors[1, ]), unlist(cors[2, ]),
>              xlim = c(0.6, 1), ylim = c(0.6, 1),
>              xlab = "correlation full range",
>              ylab = "correlation extreme cases")
>         abline(a = 0, b = 1, lwd = 2, col = "blue")
>         
>         ---
>         
>         
>         I.e., for the same bivariate relationship, you'll end up with
>         larger correlation coefficients if you've sampled at the
>         extremes (largest and lowest values) than if you were
>         indiscriminate in your choice of x-values. This may be
>         important if you want to use the correlation coefficient to
>         communicate the strength of the bivariate relationship.
>         
>         
>         Regression coefficients don't show this effect (from an
>         earlier simulation I ran):
>         
>         
>         
>         
>         Cheers,
>         
>         Jan
>         From: ling-r-lang-l-boun...@mailman.ucsd.edu
>         <ling-r-lang-l-boun...@mailman.ucsd.edu> on behalf of
>         Ambridge, Ben <ben.ambri...@liverpool.ac.uk>
>         Sent: 21 March 2016 15:13
>         To: ling-r-lang-l@mailman.ucsd.edu
>         Cc: Pine, Julian; Tatsumi, Tomoko
>         Subject: [R-lang] Correlation from extreme values
>         
>         Hi
>         
>         Is anyone able to advise on the following- probably very naive
>         - question? Is it problematic to run/interpret a correlation
>         that is driven by extreme values like this?
>         
>         Xaxis=c(1,3,3,5,6,8,85,87,90,92,97,98)
>         Yaxis=c(2,10,8,4,12,2,85,80,94,82,80,87)
>         Data = data.frame(Xaxis,Yaxis)
>         plot(Data$Xaxis, Data$Yaxis)
>         
>         And if it is problematic, would lme models (where these
>         datapoints represent - for example - the by-item means) also
>         inherit the same problems?
>         
>         Thanks
>         Ben
>         
>         <Rplot.jpeg>
>         
>         
>         
> 
>

[R-lang] Re: Correlation from extreme values

Reply via email to