bkamen <[EMAIL PROTECTED]> wrote:
 
> This practical question arose between myself and a colleague at work.
> It concerns whether we can use correlation analysis if one of the
> variables is non-continuous or "categorical."
> ...
> I would appreciate clarification of any such constraints on the
> practical use of correlation analysis.
 
I'm assuming that by "discrete" you mean that X is constrained to take
certain discrete values (e.g., 0, 1, 2, 3), but that the values
themselves are either (a) valid interval-level data (e.g., a value of
2 is truly 1 unit more than a value of 1) or (b) ordinal (e.g., a
value of 2 necessarily means a greater level of the trait than a value
of 1).
 
My understanding of how this works is as follows:
 
If X is "discrete" in this way and Y is a usual continuous measure,
then the correlation r(X,Y) will tend to be constrained in magnitude.
For example, it might be difficult or impossible to obtain a
correlation of 1 or -1 in this situation.  In that sense, a test of
r(X, Y) would seem to be conservative in principle, which I believe
your message alluded to.
 
The question appears to be how this situation affects formal
significance testing.  I do not know for sure, but it would not
surprise me if the significance test assumes that both measures are
truly continuous.  If one is "discrete" (in the sense above), I don't
know how that affects the significance test.
 
However, there is an alternative.  You could consider use of the
biserial correlation (if X has only two values) or the polyserial
correlation (if X can have more than two values--as a practical
matter, this might only apply if the number of different X values is
relatively low, say less than 8 or 10; of course, if there are many X
values, then the impact of the "discreteness" may be relatively
little).
 
The biserial/polyserial correlation estimates the correlation you
would have obtained if both X and Y were truly continuous.  The main
assumption is that, fundamentally, the traits associated with X and Y
are normally distributed (and jointly distributed as bivariate
normal).  However, the biserial/polyserial correlation allows that one
of the variables has been "discretized."
 
You might want to consider this option.  For more information, you
could check Kendall & Stuart, "The Advanced Theory of Statistics."
 
Hope this helps.
 
John Uebersax
[EMAIL PROTECTED]
 

Reply via email to