bkamen <[EMAIL PROTECTED]> wrote:
> This practical question arose between myself and a colleague at work.
> It concerns whether we can use correlation analysis if one of the
> variables is non-continuous or "categorical."
> ...
> I would appreciate clarification of any such constraints on the
> practical use of correlation analysis.
I'm assuming that by "discrete" you mean that X is constrained to take
certain discrete values (e.g., 0, 1, 2, 3), but that the values
themselves are either (a) valid interval-level data (e.g., a value of
2 is truly 1 unit more than a value of 1) or (b) ordinal (e.g., a
value of 2 necessarily means a greater level of the trait than a value
of 1).
My understanding of how this works is as follows:
If X is "discrete" in this way and Y is a usual continuous measure,
then the correlation r(X,Y) will tend to be constrained in magnitude.
For example, it might be difficult or impossible to obtain a
correlation of 1 or -1 in this situation. In that sense, a test of
r(X, Y) would seem to be conservative in principle, which I believe
your message alluded to.
The question appears to be how this situation affects formal
significance testing. I do not know for sure, but it would not
surprise me if the significance test assumes that both measures are
truly continuous. If one is "discrete" (in the sense above), I don't
know how that affects the significance test.
However, there is an alternative. You could consider use of the
biserial correlation (if X has only two values) or the polyserial
correlation (if X can have more than two values--as a practical
matter, this might only apply if the number of different X values is
relatively low, say less than 8 or 10; of course, if there are many X
values, then the impact of the "discreteness" may be relatively
little).
The biserial/polyserial correlation estimates the correlation you
would have obtained if both X and Y were truly continuous. The main
assumption is that, fundamentally, the traits associated with X and Y
are normally distributed (and jointly distributed as bivariate
normal). However, the biserial/polyserial correlation allows that one
of the variables has been "discretized."
You might want to consider this option. For more information, you
could check Kendall & Stuart, "The Advanced Theory of Statistics."
Hope this helps.
John Uebersax
[EMAIL PROTECTED]