Hi everyone Thanks for your all your help! I probably should have given a bit more information at the start: These particular data are made up but represent the proportion of past vs present-tense forms in (X axis) an input corpus and (Y axis) an elicited production study. We deliberately chose highly past-biased or present-biased verbs (though I’m now wondering if this was a mistake).
The impression I’m getting from the replies so far is that this is probably OK - and that the effect we get is meaningful - though its size will have been inflated by our choice of highly biased verbs. We are now planning a follow-up study and would be interested to hear people’s views on whether we should again choose highly biased verbs (to make it more likely that we will detect an effect if there is one to be found) or sample evenly across the distribution (to get a more realistic estimate of the size of any effect - not that this is particularly important for our purposes). Thanks Ben On 21 Mar 2016, at 22:09, Scott Jackson <scott...@gmail.com> wrote: Hi Ben, Just another perspective on what everyone else is saying: just think carefully about why your distributions look like they are. That is, each variable is extremely bimodal. Why? If it's an artifact of how you sampled, then Jan's point is especially relevant. If there is some underlying categorical structure, like classes of "high" and "low" values for both variables, then you probably just want something like a contingency table, and the variation around each class "cloud" could just be a type of measurement error. So maybe a simple correlation isn't necessarily going to give you the "wrong" inference, but it's missing something potentially very important about how your data is structured. -scott On Mon, Mar 21, 2016 at 3:58 PM, VANHOVE Jan <jan.vanh...@unifr.ch> wrote: Hello Ben, An additional thing you may want to consider is that extreme-group sampling can lead to systematically larger effect sizes than ordinary sampling: --- library(MASS) sim.cors <- function(n = 100) { dat <- mvrnorm(n, Sigma = matrix(c(1, 0.8, 0.8, 1), 2), mu = c(0, 0)) cor.tot <- cor(dat)[1, 2] cor.reduced <- cor(dat[rank(dat[,1]) <= 10 | rank(dat[,1]) >= 91, ])[1, 2] return(list(cor.tot, cor.reduced)) } cors <- replicate(1e3, sim.cors()) plot(unlist(cors[1, ]), unlist(cors[2, ]), xlim = c(0.6, 1), ylim = c(0.6, 1), xlab = "correlation full range", ylab = "correlation extreme cases") abline(a = 0, b = 1, lwd = 2, col = "blue") --- I.e., for the same bivariate relationship, you'll end up with larger correlation coefficients if you've sampled at the extremes (largest and lowest values) than if you were indiscriminate in your choice of x-values. This may be important if you want to use the correlation coefficient to communicate the strength of the bivariate relationship. Regression coefficients don't show this effect (from an earlier simulation I ran): Cheers, Jan From: ling-r-lang-l-boun...@mailman.ucsd.edu <ling-r-lang-l-boun...@mailman.ucsd.edu> on behalf of Ambridge, Ben <ben.ambri...@liverpool.ac.uk> Sent: 21 March 2016 15:13 To: ling-r-lang-l@mailman.ucsd.edu Cc: Pine, Julian; Tatsumi, Tomoko Subject: [R-lang] Correlation from extreme values Hi Is anyone able to advise on the following- probably very naive - question? Is it problematic to run/interpret a correlation that is driven by extreme values like this? Xaxis=c(1,3,3,5,6,8,85,87,90,92,97,98) Yaxis=c(2,10,8,4,12,2,85,80,94,82,80,87) Data = data.frame(Xaxis,Yaxis) plot(Data$Xaxis, Data$Yaxis) And if it is problematic, would lme models (where these datapoints represent - for example - the by-item means) also inherit the same problems? Thanks Ben <Rplot.jpeg>