Hi everyone

Thanks for your all your help! I probably should have given a bit more 
information at the start: These particular data are made up but represent the 
proportion of past vs present-tense forms in (X axis) an input corpus and (Y 
axis) an elicited production study. We deliberately chose highly past-biased or 
present-biased verbs (though I’m now wondering if this was a mistake).

The impression I’m getting from the replies so far is that this is probably OK 
- and that the effect we get is meaningful - though its size will have been 
inflated by our choice of highly biased verbs.

We are now planning a follow-up study and would be interested to hear people’s 
views on whether we should again choose highly biased verbs (to make it more 
likely that we will detect an effect if there is one to be found) or sample 
evenly across the distribution (to get a more realistic estimate of the size of 
any effect - not that this is particularly important for our purposes).

Thanks
Ben


On 21 Mar 2016, at 22:09, Scott Jackson <scott...@gmail.com> wrote:

Hi Ben,

Just another perspective on what everyone else is saying: just think carefully 
about why your distributions look like they are.  That is, each variable is 
extremely bimodal.  Why?  If it's an artifact of how you sampled, then Jan's 
point is especially relevant.  If there is some underlying categorical 
structure, like classes of "high" and "low" values for both variables, then you 
probably just want something like a contingency table, and the variation around 
each class "cloud" could just be a type of measurement error.  So maybe a 
simple correlation isn't necessarily going to give you the "wrong" inference, 
but it's missing something potentially very important about how your data is 
structured.

-scott


On Mon, Mar 21, 2016 at 3:58 PM, VANHOVE Jan <jan.vanh...@unifr.ch> wrote:
Hello Ben,


An additional thing you may want to consider is that extreme-group sampling can 
lead to systematically larger effect sizes than ordinary sampling:


---
library(MASS)

sim.cors <- function(n = 100) {
dat <- mvrnorm(n, Sigma = matrix(c(1, 0.8, 0.8, 1), 2), mu = c(0, 0))
cor.tot <- cor(dat)[1, 2]
cor.reduced <- cor(dat[rank(dat[,1]) <= 10 | rank(dat[,1]) >= 91, ])[1, 2]
return(list(cor.tot, cor.reduced))
}

cors <- replicate(1e3, sim.cors())

plot(unlist(cors[1, ]), unlist(cors[2, ]),
     xlim = c(0.6, 1), ylim = c(0.6, 1),
     xlab = "correlation full range",
     ylab = "correlation extreme cases")
abline(a = 0, b = 1, lwd = 2, col = "blue")

---


I.e., for the same bivariate relationship, you'll end up with larger 
correlation coefficients if you've sampled at the extremes (largest and lowest 
values) than if you were indiscriminate in your choice of x-values. This may be 
important if you want to use the correlation coefficient to communicate the 
strength of the bivariate relationship.


Regression coefficients don't show this effect (from an earlier simulation I 
ran):




Cheers,

Jan
From: ling-r-lang-l-boun...@mailman.ucsd.edu 
<ling-r-lang-l-boun...@mailman.ucsd.edu> on behalf of Ambridge, Ben 
<ben.ambri...@liverpool.ac.uk>
Sent: 21 March 2016 15:13
To: ling-r-lang-l@mailman.ucsd.edu
Cc: Pine, Julian; Tatsumi, Tomoko
Subject: [R-lang] Correlation from extreme values
 
Hi

Is anyone able to advise on the following- probably very naive - question? Is 
it problematic to run/interpret a correlation that is driven by extreme values 
like this?

Xaxis=c(1,3,3,5,6,8,85,87,90,92,97,98)
Yaxis=c(2,10,8,4,12,2,85,80,94,82,80,87)
Data = data.frame(Xaxis,Yaxis)
plot(Data$Xaxis, Data$Yaxis)

And if it is problematic, would lme models (where these datapoints represent - 
for example - the by-item means) also inherit the same problems?

Thanks
Ben

<Rplot.jpeg>



Reply via email to