(This post is formatted for viewing with a fixed-width font, such
as Courier.)
This post is dedicated to the memory of Daniel DeLury (1907 -
1993) of the Department of Statistics at the University of
Toronto. Dr. DeLury's influence on me is reflected throughout
the post, but most directly in the last appendix.
This post evaluates seven definitions of the concept of 'rela-
tionship between variables', including important definitions pro-
posed in earlier posts by Jan de Leeuw, Herman Rubin, and Robert
Frick. It also discusses whether a teacher needs to discuss uni-
variate distributions or mathematics near the beginning of an in-
troductory statistics course for students who are not majoring in
statistics.
For simplicity, I assume throughout this post that all variables
are numeric -- that is, their values are numbers. However, the
discussions and conclusions below easily generalize to situations
with non-numeric variables if the values of the variables are
(suitably) recoded to be numeric, and then one thinks in terms of
the recoded values.
A DEFINITION OF "RELATIONSHIP BETWEEN VARIABLES" BASED ON
EXPECTED VALUE
Responding to two informal definitions proposed by Herman Rubin
(in sci.stat.edu on 98/8/3), I proposed (on 99/5/16):
>> DEFINITION: There is a *relationship* between the vari-
>> ables x and y if for at least one value x' of x
>>
>> E(y|x') ~= E(y) [1]
>> where
>>
>> E(*) is the expected-value operator
>>
>> E(y|x') is the expected value of y given that x has
>> the value x' and
>>
>> ~= stands for "is not equal to".
>>
>> Defining the concept of 'relationships between variables' in
>> terms of conditional expected value leads to a simpler defini-
>> tion than the definitions Herman proposes ... because the
>> expected-value approach replaces the complicated concept of
>> 'distribution' with the simpler concept of 'expected value'
>> [1999a].
JAN DE LEEUW'S REMARKS ABOUT [1]
Quoting [1], Jan de Leeuw writes (on 99/5/16)
> It seems to me this is too narrow. Suppose, for example, that
> E(y|x) = E(y) for all x, but V(y|x) ~= V(y) for some x, where
> V is variance (for instance V(y|x) = \sigma^2 x^2). Seems like
> a relationship to me.
Two points of view are available to handle Jan's example:
1. We can adopt Jan's point of view and say that the example IS
an example of a relationship between the variables x and y.
2. We can adopt the point of view suggested by [1] and say that
the example IS NOT an example of a relationship between the
variables x and y. (Of course, [1] clearly implies that the
example is an example of a relationship between the variables
x and V(y).)
We can adopt either of these points of view because they both ap-
pear to work satisfactorily. I discuss which point of view is
preferred below, but first it is helpful to consider some pre-
liminary material.
(In an introductory statistics course for less advanced students
a teacher might reasonably decide not to present ANY formal defi-
nition of the concept of 'relationship between variables'. In
this case the teacher would not present either of the above
points of view. Instead, the teacher might choose to character-
ize the concept of 'relationship between variables' informally in
terms of one variable "depending" on the other, or in terms of
the values of one variable "varying somewhat in step" with the
values of the other. Although these characterizations are not
mathematically explicit, I believe they are reasonable approaches
for less advanced students IF the characterizations are developed
in terms of sufficient practical examples.)
>
> There is also a problem with symmetry. Can we reverse the role
> of x and y in these definitions ? It seems so.
Appendix C discusses the symmetry of definitions of the concept
of 'relationship between variables'.
JAN DE LEEUW'S DEFINITION OF "RELATIONSHIP BETWEEN VARIABLES"
>
> This leads to a somewhat more straightforward definition: there
> is a relationship between random variables x and y if and only
> if they are not independent
Jan defines the concept of 'relationship between variables' in
terms of the concept of 'independence' of variables. This leads
one to seek a definition of that concept, which Jan provides (in
terms of a definition of "dependence" or "relationship") as fol-
lows:
> (or, if you like, p(y|x) ~= p(y) for some x).
For clarity, let me make Jan's definition more explicit:
DEFINITION: There is a *relationship* between the random
variables x and y if and only if
p(y'|x') ~= p(y') [2]
for some x' and some y' where
p(y') = the unconditional probability that the vari-
able y has the value y' (or equals the value
of the probability density of y at y') and
p(y'|x') = the probability that the variable y has the
value y' given that the variable x has the
value x'(or equals the value of the probabil-
ity density of y at y' given that x is at x').
I hope that [2] properly characterizes the spirit of Jan's defi-
nition. However, [2] differs from Jan's definition in two sig-
nificant ways:
- Primes appear on x and y in [2] to reinforce the idea that the
definition is referring to (any) SPECIFIC values of the vari-
ables x and y. That is, the variables are being used in the
existential sense, as opposed to the universal sense. This is
also implied by the phrase "for some x' and some y'" in the
definition.
- The phrase "and some y'" is added to [2]. This gives y the
same existential freedom in the definition as x, which helps to
show the broadness of the definition.
Note that [2] has the same structure as [1], with the expected-
value operator replaced by the probability (density) operator.
Definition [2] refers to the concept of a "random" variable, but
definition [1] does not refer to this concept. Appendix A dis-
cusses the concept of 'random variable'.
Definition [2] is broader than [1] because [2] is satisfied by a
difference at any point across the two distributions (conditional
and unconditional) of the values of y, while [1] is satisfied
only if the means of the two distributions are different.
I further compare [1] and [2] below.
HERMAN RUBIN'S REMARKS ABOUT INTRODUCTORY STATISTICS
For brevity, I use the phrase "non-statistics-majors" in the fol-
lowing discussion to refer to students who are not majoring in
statistics or mathematics.
Herman Rubin begins his post by addressing the issue that started
the debate about the definition of "relationship between vari-
ables". He writes (on 99/5/17)
> Donald Macnaughton ... wrote:
>
M> Quoting a 98/7/23 post of mine, Herman Rubin writes (on
M> 98/8/3)
>>
R>> Donald Macnaughton ... wrote:
>>>
M>>> In a July 17 post I recommend that teachers emphasize the
M>>> concept of a relationship between variables and I recommend
M>>> a de-emphasis of less important topics such as univariate
M>>> distributions ...
>>>
R>> As such, I agree about the point on univariate distributions.
R>> One does not need a catalog of the standard ones, nor [does
R>> one need to] be adept at calculating them.
R>>
R>> HOWEVER, on consideration of the actual problems, they are an
R>> essential tool.
>>
M> I fully agree that univariate distributions are an essential
M> tool in actual statistical problems -- most statistical analy-
M> ses depend directly on concepts of univariate distributions.
M>
M> However, as Herman may agree, the ubiquity of univariate dis-
M> tributions in statistical analyses does NOT speak to whether a
M> teacher should discuss univariate distributions near the be-
M> ginning of an introductory statistics course when the course
M> is aimed at students who are NOT majoring in statistics.
>
> On the contrary, it is these who need to understand, not the
> formulas for the standard univariate distributions, but what
> distributions, including univariate, are in general, and also
> some of their basic properties.
Herman and I disagree here: He recommends that a teacher discuss
univariate distributions near the beginning of an introductory
course for non-statistics-majors. In contrast, I recommend that
a teacher begin such a course by discussing relationships between
variables, with no discussion (near the beginning) of univariate
distributions.
>
> If the person in the other field cannot move the problem from
> "biological space" to "statistics space", the problem is not
> ready for the use of statistics
Again, I respectfully disagree. By forcing our students (and
clients) to clamber from their own "space" into our "statistics
space" (that is, by forcing them to speak our complicated mathe-
matical language), I believe we confuse them and frighten many of
them away.
I believe that the mathematical language is unnecessary for non-
statistics-majors. Instead of struggling to explain the mathe-
matics, we can focus on the practical use of statistics in em-
pirical research. We can reasonably tell students that the main
practical use of statistics is to assist researchers to study re-
lationships between variables.
As noted, a relationship exists between two variables if when one
variable "goes up and down" in entities (or in the entities' en-
vironment), the other variable "goes up and down somewhat in
step". We can illustrate this phenomenon on a scatterplot with
no discussion of any underlying mathematics. We can then gener-
alize these ideas in various useful directions, again without the
(direct) need of mathematics.
In particular, we can show students that most empirical research
projects (or logical units of research projects) can be usefully
viewed as studying the relationship between a single response
variable and one or more predictor variables. The response vari-
able is the variable that we wish to learn how to predict or con-
trol. The predictor variable(s) is (are) the other variable(s)
that we observe or manipulate in a research project to help us
learn how to predict or control the values of the response vari-
able.
Many readers will agree that the statistical procedures that are
commonly used in empirical research include
- the t-test
- analysis of variance
- regression analysis
- response surface analysis
- categorical analysis
- time series analysis
- survey analysis
- survival analysis
- Bayesian analysis
- neural networks
- discriminant analysis
- nonparametric analysis
- logistic regression analysis
- probit analysis
- data mining methods
- univariate methods
- and others.
Examination of these procedures suggests that they can all be
reasonably and usefully viewed (for the most part) as optimal
methods for studying the relationship between a single response
variable and zero or more predictor variables under various cir-
cumstances.
The preceding four paragraphs suggest that the easy-to-understand
concept of 'relationship between variables' is a central unifying
concept of both the field of statistics and empirical research.
Thus it is reasonable to emphasize this concept in an introduc-
tory statistics course.
The main ideas are surprisingly simple: In a typical research
project using statistical methods the researcher (e.g., a medical
researcher) would like the field of statistics to answer three
key questions, which are
1. How can we discover and demonstrate reliable evidence that a
relationship exists (if one does) between the response vari-
able and predictor variable(s) of interest?
2. If we find good evidence that a relationship exists, how can
we best use our knowledge of the relationship to predict or
possibly control the values of the response variable in new
entities from the population on the basis of the values of the
predictor variable(s)?
3. If we make such predictions or attempt such control, how accu-
rate will the prediction or control be?
These questions make no reference (at least on the surface) to
mathematics. We can show non-statistics-majors that much of the
field of statistics is about answering these questions in empiri-
cal research under various circumstances. Discussing these ideas
(using sufficient practical examples) gives students a broad
overview of the vital role of statistics in empirical research.
This is more likely to impress non-statistics-majors than if we
discuss the mathematics.
Under this approach I do not suggest that we hide from students
the fact that statistical procedures are based on mathematical
principles. Instead, I recommend that teachers make students
well aware of the EXISTENCE of the important underlying mathemat-
ics. But we can defer the details until a later course.
Similarly, it is important to inform students about the underly-
ing assumptions of statistical analysis -- we cannot have confi-
dence in the conclusions of a statistical analysis unless we know
that the underlying assumptions of the analysis are adequately
satisfied by the situation and data under study. I recommend
that introductory statistics teachers impress students with this
important point. But, as with the mathematics, we can defer the
details of the assumptions until later.
I further discuss the above points and the teaching approach I
recommend in two essays (1998a, 1999b) and in appendix H of this
post. I discuss empirical research projects that do not study
relationships between variables in two essays (1997a; 1999b, app.
C). Moore (1997a, sec. 4) and the American Statistical
Association (2002) also recommend de-emphasizing mathematics in
statistics education.
STATISTICAL PROCEDURES AS RELIGIOUS MANTRAS
Herman continues ...
> [If the person in the other field cannot move the problem from
> "biological space" to "statistics space", the problem is not
> ready for the use of statistics] except as "religious" mantras.
I like the metaphor of a religious mantra to characterize certain
traditional practices in statistics. One area of statistics in
which I think statisticians and empirical researchers sometimes
use a mantra is in the important area of hypothesis testing. I
have written about hypothesis testing earlier (1997b, sec. 9;
1998b, sec. 5) and the ideas appear from time to time below. I
plan to present some further ideas in a later post.
HERMAN RUBIN'S FIRST DEFINITION OF "RELATIONSHIP BETWEEN
VARIABLES"
Herman next changes his focus to the main topic of the present
post -- the definition of the concept of 'relationship between
variables'. Quoting [1] above, he writes
>
> I agree with de Leeuw that this definition is far too narrow.
>
> The appropriate version of this [is]
>
> DEFINITION: There is a *stochastic relationship*
> between the random variables X and Y if for at
> least one value x' the conditional distribution [3]
> of Y given X=x' is different from the uncondi-
> tional distribution of Y.
Definition [3] is equivalent to Jan de Leeuw's definition [2] in
the sense that [3] will declare that a relationship exists be-
tween two "compatible" variables if and only if [2] also declares
that a relationship exists. Appendix B discusses the equivalence
of [2] and [3].
(Two variables are "compatible" if they both reflect properties
of the same type of entity [or one may reflect a property of the
entities' environment], and if the available values of the vari-
ables are reasonably linked within entities and within time.
Clearly, we can reasonably study a relationship between variables
only if the variables are compatible.)
Despite the equivalence of [2] and [3], definition [3] differs
from definition [2] in the sense that [3] is effectively refer-
ring to the entire probability (density) function of the y-values
for a given x-value while [2] is effectively referring to a point
on the probability (density) function of the y-values for the
given x-value. Definition [2] is thus more specific, and thus
perhaps slightly clearer, reducing the necessary and sufficient
condition for a relationship to a reasonable minimum condition.
>
> An alternative version is that X and Y are dependent random
> variables. But the operational meaning of this is the above
> formulation; objects are independent if knowing one provides no
> information about the distribution of the other. I would sug-
> gest that this be used as the definition of independence, and
> it goes over immediately to many objects.
For discussions about empirical research, I agree with Herman's
approach of defining the concept of 'independence' in terms of
the concept of 'relationship between objects', rather than the
other way around. On the other hand, in theoretical discussions
it is often useful to begin with and focus on the concept of 'in-
dependence', as discussed in appendix I.
Herman speaks about "objects" because he wishes to apply the con-
cept of 'independence' to two different types of object, as he
indicates in his next sentence:
> A random variable here is an object, as is an event.
Consider Herman's concept of an event, and consider his notion
(implied in the second most recent quotation above) of the "dis-
tribution" of events -- how are events distributed? One answer
is that they are distributed over time (or over some other appro-
priate dimension). Thus consider the variable "time of an
event". We can view the idea of independence of events simply in
terms of independence of (i.e., lack of relationship between) two
variables reflecting the (distribution over) time of the two
(types of) events.
Thus rather than needing two notions of independence (one for
variables and the other for events), we can subsume both types of
independence under the idea of a lack of a relationship between
variables.
>
> How can this be understood without knowing what it means for
> something to be the distribution of a random variable?
The referent of Herman's "this" is unclear although convention
suggests that the referent is the point he makes in his sentence
that precedes the above sentence. However, I suspect that Herman
is not referring to the (somewhat peripheral) point in that sen-
tence. Instead, I suspect that he is asking how his definition
[3] of the concept of a relationship between variables can be un-
derstood if one does not understand the concept of the distribu-
tion of the values of a (random) variable.
If that is Herman's point, I fully agree with it. If we are to
successfully use [3] (or [2]) to define the concept of 'relation-
ship between variables' in a statistics course, students must
first understand the concept of the distribution of the values of
a variable.
HERMAN RUBIN'S REMARKS ABOUT [1]
Herman continues ...
>
> Expectation should not be taught using the formulas usually
> given,
By the "formulas usually given" I think Herman means the sum (or
integral) across all the possible values of the variable of the
product of the variable and its probability (density) function
[e.g., for the variable x, the sum (integral) across x of the
product of x and p(x)].
If we wish to teach the concept of 'expected value' to non-
statistics-majors, I agree with Herman that the formulas usually
given should not be used. We can teach the concept to non-
statistics-majors in terms of the concept of 'arithmetic mean' or
'average'. That is, the expected value is the value we will get
if we compute the average of the values of the variable for all
the entities in the population.
Non-statistics-majors readily understand that we can estimate
with reasonable precision the expected value of any variable by
computing the average of the values of the variable in a suitable
sample. Here students need an INFORMAL awareness of the concept
of 'distribution'. That is, they need to understand the idea
that the values of variables generally vary. But they need no
mathematical awareness of distributions beyond adding together
the values and dividing by N. (Most students already know from
statistical reports in the media that the average lies at the
"center" of the values.)
> but those formulas involve the concept of distribution as well.
I think Herman is here making the following argument:
- Definition [1] defines the concept of 'relationship between
variables' in terms of the concept of 'expectation' or 'ex-
pected value'.
- But the formulas usually given for the concept of expected
value' involve the concept of 'distribution'.
- Therefore, [1] depends on the concept of 'distribution'.
I agree that [1] appeals to the concept of 'expected value' and
that the formulas usually given for expected value involve the
concept of 'distribution' [which is implicit in the function
p(x)]. However, if we bypass the formulas usually given and
characterize the concept of 'expected value' in terms of the con-
cept of 'arithmetic mean' or 'average', we bypass the need to re-
fer to the mathematical concept of 'distribution'. This makes
the ideas substantially easier to understand. I further discuss
this approach to expected value in a paper for students (1997b,
sec. 7.10).
HERMAN RUBIN'S SECOND DEFINITION OF "RELATIONSHIP BETWEEN
VARIABLES"
>
> The above definition could also be formulated as
>
> E(f(y)|x') ~= E(f(y)) [4]
>
> for all functions f for which the expectations exist,
I suspect that Herman here means not ALL functions f but, in-
stead, SOME function f from among the set of all functions for
which the expectations exist. That is, under [4] a relationship
exists between X and Y if and only if [4] is satisfied for some
x' and some (any) specific value of Y, and some (one, any) func-
tion f.
In [4] Herman has neatly changed from using the probability (den-
sity) operator as the main operator in the definition (as in [2]
and [3]) to using the expected-value operator (as in [1]).
Definition [4] is equivalent to [2] and [3] in the sense that [4]
will declare that a relationship exists between two compatible
variables if and only if [2] and [3] also declare that a rela-
tionship exists. Appendix B discusses the equivalence of [2],
[3], and [4].
If the function f in [4] is the identify function (which it usu-
ally can be), [4] becomes [1]. In other words, [1] identifies a
subset of the cases that satisfy [4].
(For the broadest generality, the function f in [4] is allowed to
take multiple y-values [i.e., a vector of y-values] as its argu-
ment. This enables us to include the variance function [as used
in Jan de Leeuw's example] and similar functions in the set of
permissible functions that may appear as f in the definition.)
I further discuss Herman's two definitions below, but first it is
helpful to consider three other definitions of the concept of
'relationship between variables'.
A STANDARD DEFINITION OF THE CONCEPT OF 'RELATIONSHIP BETWEEN
VARIABLES' FROM MATHEMATICAL STATISTICS
Jan de Leeuw begins his discussion above of the definition of
"relationship between variables" with the concept of 'independ-
ence', as opposed to beginning with the concept of 'dependence'
or 'relationship'. Jan may have begun this way because it is a
standard way to begin. For example, Freund and Walpole present
the following familiar definition of "independence" of two vari-
ables in their popular mathematical statistics textbook:
If p(x,y) is the value of the joint probability distri-
bution of the discrete random variables X and Y at
(x,y), and p1(x) and p2(y) are the values of the mar-
ginal distributions of X and Y at x and y, X and Y are
*independent* if and only if
p(x,y) = p1(x) p2(y)
for all (x,y) within their range.
To give a corresponding definition for continuous random
variables, we simply substitute the word "density" for
the word "distribution" [and the word "continuous" for
the word "discrete"] (1987, p. 126).
(For comparability, I have reduced Freund and Walpole's original
definition from N variables to two variables and I have changed
the variable and function names to be consistent with those in
this post.)
Hogg and Craig, in their popular mathematical statistics text-
book, define "independence" the same way, although they use dif-
ferent wording and notation (1995, p. 101). Other textbooks also
give conceptually the same definition, and thus Freund and
Walpole's definition reflects a widely-held view of the concept
of 'independence of two variables'.
Freund and Walpole emphasize the concept of 'independence' and
give much less attention to the concept of 'dependence' or 'rela-
tionship'. Instead, "dependence" between two variables is simply
(and reasonably) implied as the negation of independence.
Thus we can define the concept of 'relationship between vari-
ables' in terms of the negation of Freund and Walpole's defini-
tion of "independence". A reasonable version of this is
DEFINITION: If p(x,y) is the value of the joint prob-
ability (density) function of the random variables X and
Y at (x,y), and p1(x) and p2(y) are the values of the
marginal probability (density) functions of X and Y at x
and y, there is a *relationship* between X and Y if and
only if
p(x,y) ~= p1(x) p2(y) [5]
for some (x,y) within their range.
Definition [5] is equivalent to [2] through [4] in the sense [5]
will declare that a relationship exists between two compatible
variables if and only [2] through [4] also declare that a rela-
tionship exists -- see appendix B.
Although [5] is equivalent to [2] through [4] in the sense above,
[5] is different from [2] through [4] (and [1]) in an important
sense -- [5] makes no significant distinction between the re-
sponse variable and the predictor variable, while the other defi-
nitions all make such a distinction. Specifically, definitions
[1] through [4] use the vertical bar to mean "given that", and
the response variable y always appears to the left of the verti-
cal bar, and the predictor variable x always appears to the right
of the vertical bar. (The vertical bar is implicit in [3].)
As discussed above, most empirical research projects (or logical
units of research projects) can be usefully viewed as studying
the relationship between a single response variable and one or
more predictor variables. Thus the distinction between the re-
sponse variable and the predictor variable(s) is important in
most empirical research projects. But [5] does not significantly
distinguish between these variables. Thus [5] has less direct
applicability to the use of statistics in empirical research than
[1] through [4].
Consider the issue of quantification: Note how Freund and
Walpole's definition of "independence" is a universally quanti-
fied statement, as indicated by the phrase "for all (x,y)" in the
definition. On the other hand [5], which is the negation of
Freund and Walpole's definition, is an existentially quantified
statement, as indicated by the phrase "for some (x,y)" in the
definition. Definitions [1] through [4] are also existentially
quantified statements, as is underscored by the primes on some of
the x's and y's in the definitions.
An advantage of using an existentially quantified definition is
that, in general, existentially quantified statements can (if
they are true) be verified in empirical research while univer-
sally quantified statements can almost never (even if they are
true) be directly verified. (Universally quantified statements
can be falsified -- see appendix D.) Universally quantified
statements cannot be verified because proper verification re-
quires an exhaustive search, which (due to resource limitations)
is almost always impossible.
(Because providing empirical support for a universally quantified
statement is almost always impossible, empirical researchers
rarely make universally quantified statements. In particular,
empirical researchers rarely try to EMPIRICALLY support the claim
that NO relationship exists between two given compatible vari-
ables -- that is, they rarely [if ever] try to empirically sup-
port the claim that the two variables are independent. They do
not attempt to support this claim because generally it cannot be
reasonably empirically supported. Instead, following the princi-
ple of parsimony, most experienced researchers simply formally
ASSUME that no relationship exists between a response variable
and one or more compatible predictor variables until unequivocal
empirical evidence is brought forward that allows us to reject
the "null" assumption of no relationship.
(Appendix E discusses the "conservation" laws of physical sci-
ence, which are an interesting exception to the main point of the
preceding paragraph.)
As noted above, non-statistics-majors are more likely to be im-
pressed by the practical side of statistics. Thus it makes sense
to introduce them to the existentially quantified definition of
"relationship between variables" (as defined by any of defini-
tions [1] through [7] in this post) instead of the universally
quantified definition of "independence". This helps non-
statistics-majors to see the practical use of statistics in em-
pirical research, which is mostly about relationships between
variables (and not about "independences").
On the other hand, if we are teaching statistics to students who
ARE majoring in statistics or mathematics, it is important to in-
troduce the fundamental universally quantified definition of "in-
dependence of N random variables", as discussed in appendix I.
ROBERT FRICK'S DEFINITION OF "RELATIONSHIP BETWEEN VARIABLES"
Bob Frick wrote privately to me proposing another form of [1].
(I quote him here with his permission.) Referring to [1], he
writes
> I propose
>
> DEFINITION: There is a *relationship* between the vari-
> ables x and y if for at least one pair of values x'
> and x" of x
>
> E(y|x') ~= E(y|x"). [6]
>
> I think this definition is mathematically equivalent to your
> definition and better fits my intuitive understanding and the
> typical understanding of causality. I had to translate from
> your formulation to mine in order to understand and evaluate
> yours.
I agree with Bob that [1] and [6] are "mathematically equivalent"
in the sense that [1] will declare that a relationship exists be-
tween two compatible variables if and only if [6] also declares
that a relationship exists. Appendix F discusses the equivalence
of [1] and [6].
(Discussion at several places below focuses on CONTINUOUS [as op-
posed to discrete] response variables. This is because continu-
ous response variables are available in most areas of empirical
research and continuous variables generally carry substantially
more information in their values. Thus using a continuous re-
sponse variable usually enables a researcher to obtain better
knowledge of the relationship between the response variable and
the predictor variable[s] under study.)
Consider some properties of [1] and [6]:
1. Definition [1] is simpler than [6] in the sense that the right
side of [1] is an unconditional expected value while the right
side of [6] is a (more complicated) conditional expected
value.
2. Definition [6] directly reflects how the detection of rela-
tionships between variables is usually done in an important
case in empirical research -- the case in which the response
variable is continuous and the predictor variable is discrete,
with two values. This reflects the simplest standard exper-
mental design, which is usually best analyzed with the sim-
plest case of analysis of variance -- the one-way case with
two levels (also called the t-test). In this case we test
whether a relationship exists between the variables by testing
whether [6] (not [1]) is satisfied. Statistical practice fa-
vors [6] here over [1] because a research project properly
based on [6] generally provides (other things being equal) a
more powerful statistical test of whether the sought-after re-
lationship exists.
3. The approach implied by [6] is generally not used to detect
relationships between variables when we have a continuous re-
sponse variable and a CONTINUOUS predictor variable. In this
case the test for the existence of a relationship is generally
a test of whether a parameter in a model equation has a cer-
tain "null" value. We know or believe that the parameter will
have the null value (typically zero) if no relationship is
present and will have a different value if a relationship IS
present. If we can reasonably reject the hypothesis that the
parameter has the null value, we can (tentatively) conclude
that a relationship exists between the variables. Statistical
practice favors a test of a parameter here instead of the ap-
proach implied by [6] because the parameter test provides
(other things being equal) a more powerful statistical test of
whether the sought-after relationship exists.
4. An approach (properly) based on [6] can generally give better
prediction or control capability than a similar approach based
on [1].
5. Although the approach implied by [6] is directly used SOME of
the time to test for relationships between variables in em-
pirical research, the approach implied by [1] is almost NEVER
directly used. Instead, forms that can be derived from [1]
(such as [6] or a test of a parameter) are used in actual
practice.
6. Consider the case in which humans informally study relation-
ships between variables in everyday life. In this case we are
usually not conscious of the concept of 'relationship between
variables'. For example, after several visits to a new bank a
person may observe, "The earlier in the morning I go to the
bank, the less time I have to wait to be served." ("Duration
of waiting time" is the response variable and "bank arrival
time" is the predictor variable.) In this case people seem
more often to BEGIN with an approach resembling [1] than one
resembling [6]. This may be because [1] is simpler and lends
itself at least as well as [6] to natural situations. Here,
we often begin without knowledge of the identity of the rele-
vant predictor variable, and thus without direct knowledge of
the values of the response variable when the predictor vari-
able is at two different values (x' and x"), although this
type of knowledge usually comes later. Instead, we initially
discover the relationship by noting that the response variable
deviates from its expected value when the predictor variable
is at a particular value (or in some value range). Thus [1]
seems more basic or more "natural" to me than [6].
The above discussion suggests that [1] and [6] both have advan-
tages. Thus in statistics courses in which the teacher decides
to introduce [1] or [6] to define "relationship between vari-
ables", I recommend introducing both definitions to students.
A DEFINITION OF "RELATIONSHIP BETWEEN VARIABLES" IN TERMS OF A
MATHEMATICAL FUNCTION
The discussion above refers to the idea of a parameter in a model
equation. The idea of a model equation suggests the following
definition of the concept of 'relationship between variables':
DEFINITION: There is a *relationship* between the vari-
ables x and y if the value of y can be expressed as a
non-constant mathematical function of the value of x. [7]
An "error" term e is generally included with the func-
tion, where e is usually viewed as being independent of
x. This is stated algebraically as
y = g(x) + e.
For maximum generality, the function g(x) is shown as a fully
general function. However, in any real study of a relationship
between variables the general function g is replaced by a spe-
cific mathematical function that is chosen from among the many
types that are available.
Definition [7] is closely related to the concept of 'expected
value' because the function g is almost always chosen so as to
"best" estimate E(y|x).
(If the response variable is continuous and if the conditional
distribution of its values is noticeably non-symmetric, which I
estimate occurs in less than four percent of empirical research
projects with continuous response variables, the median may be
used instead of the mean [expected value]. The same basic prin-
ciples apply, but g(x) estimates the median of the conditional
distribution of the response variable instead of the mean.)
The mathematical form of g is chosen mostly through analysis of
data obtained in empirical research, although theoretical consid-
erations sometimes also play a central role, especially in the
physical sciences.
The function g is usually a mathematical function in the strict
sense of the term "function". That is, g is a one-to-one or pos-
sibly many-to-one mapping between two sets, with no random ele-
ment involved. (That is, the mapping is not one-to-many.)
The random element in [7] is handled by the error term e (which
is usually represented by the lowercase Greek letter epsilon).
This term takes account of the (empirical) fact that invariably
in real situations the best function g cannot perfectly predict
the associated value of y from a value of x -- the e is the error
in the prediction. Researchers often determine the distribution
of e, but in any real situation the term itself has a different
unpredictable value every time an instance of the equation oc-
curs.
The error term in [7] provides two important conceptual benefits:
1. The error term enables the equation to satisfy the mathemati-
cal requirements of the equals sign.
2. The error term collects all the unaccounted-for variation in
the values of y in a single sensible place. (Some complicated
analyses use multiple error terms.)
Definition [7] is equivalent to [1] and [6] in the sense that [7]
will declare that a relationship exists between two compatible
variables if and only if [1] and [6] also declare that a rela-
tionship exists. Appendix F discusses the equivalence of [1],
[6], and [7].
(Definition [7] is not equivalent to definitions [2] through [5],
but can be made so by broadening it, as discussed in appendix G.)
Definition [7] is important because mathematical functions are
often used to represent relationships between variables in most
branches of the physical and biological sciences, and also (at a
more abstract and implicit level) in much research in the social
sciences. In the physical sciences the error term e in [7] is
usually omitted, but the same general principle of stating rela-
tionships between variables in the form of mathematical functions
is widely used.
Definition [7] refers to the concept of 'independence'. A
teacher presenting [7] to students can use the standard approach
exemplified above in Freund and Walpole's definition of "inde-
pendence of two random variables" to characterize 'independence'.
However, that approach appeals to the concept of 'distribution'.
Thus students must understand the statistical concept of 'distri-
bution' to fully understand [7].
If a teacher chooses to present definitions [1] and [6] to stu-
dents, and IF the students have sufficient mathematical ability,
I recommend that the teacher also present definition [7]. I rec-
ommend that the three definitions be presented in succession,
separated only by careful discussion of practical examples of ac-
tual relationships to reinforce each definition. Presenting the
definitions in succession helps students to attain a unified
sense of the various ways that relationships between variables
appear in life and empirical research.
(My experience suggests that most students cannot understand ANY
definition of the concept of 'relationship between variables'
without sufficient discussion of practical examples, with "suffi-
cient" depending on the level of the students.)
COMPARISON OF THE DEFINITIONS
The preceding material discusses seven definitions of the concept
of 'relationship between two variables'. Which definition is
preferred?
To simplify this question, the following discussion views defini-
tions [1], [6], and [7] as if they are the same definition. This
is reasonable because the three definitions are theoretically
equivalent, as noted in appendix F. The discussion refers to the
three definitions jointly as the "expected-value" definition of
the concept of 'relationship between variables'.
Similarly, the following discussion views definitions [2], [3],
[4], and [5] as if THEY are the same definition. This is reason-
able because [2] through [5] are theoretically equivalent, as
noted in appendix B. The discussion refers to [2] through [5]
jointly as the "distribution" definition of the concept of 'rela-
tionship between variables'.
The expected-value and distribution definitions are not equiva-
lent, as is illustrated by Jan de Leeuw's variance example: If
we consider the example with y in the role of the response vari-
able, the expected-value definition does not directly declare
that a relationship exists between x and y, but the distribution
definition does directly declare that a relationship exists.
Since the two definitions are not equivalent, which of them is
preferred?
It is reasonable to split this question into two more specific
questions:
- Which definition is preferred in the introductory statistics
course for non-statistics-majors?
- Which definition is preferred in general statistical discourse?
In determining the preferred definition, I assume we are not
Platonists. Thus neither the expected-value definition nor the
distribution definition is more "correct". This is because we do
not believe that some true Platonic concept of 'relationship be-
tween variables' exists somewhere, and we are trying to capture
the concept in the definition. Instead, we are free to CHOOSE a
definition as being "correct". Many readers will agree that a
reasonable approach to making this choice is to choose whichever
definition has more conceptual advantages.
Consider some features and advantages of the expected-value defi-
nition:
1. The expected-value definition is easier to understand than
the distribution definition because it does not require
mathematical understanding of the statistical concept of
'distribution (of the values of a variable)'.
2. Empirical researchers are generally much more interested in
directly predicting or controlling the values of the response
variable in an empirical research project (i.e., in predict-
ing or controlling expected value) than in predicting or con-
trolling the values of higher moments (e.g., variance) of the
response variable. And although situations arise (especially
in quality control) in which examples like Jan's are impor-
tant, I estimate that more than ninety-six percent of all em-
pirical research projects that study relationships with con-
tinuous response variables (as reported in the empirical re-
search literature) can be reasonably understood as viewing
relationships in terms of the expected value (or occasionally
in terms of the expected median) of the response variable.
And usually, if a relationship is found between the VARIANCE
of the response variable and a predictor variable, this is
merely viewed as a NUISANCE. (The variance relationship is a
nuisance because heterogeneity of response variable variance
adds complexity to the analysis.)
3. Consistent with point 2, the expected-value definition is
(implicitly) used much more frequently than the distribution
definition to define the statistical tests that are performed
in empirical research to detect relationships between vari-
ables. In the case of a continuous response variable the
tests are almost always (effectively) tests of whether some
MEASURE OF CENTER or some PARAMETER of a model (both of which
are often linear functions of the [perhaps trimmed or
subsetted] values of the response variable) has some value,
or is different from some other fixed value, or is different
from some other empirically derived value or values. These
tests are thus effectively tests of the FIRST moment (possi-
bly with appropriate subsetting) of the values of the re-
sponse variable. Only rarely are the key tests performed on
other moments of the values of the response variable. Also,
tests that are in terms of the probability (density) function
of the values of the response variable are performed only in-
frequently. (Instances occur when the response variable is
discrete, as opposed to continuous but, as suggested above,
discrete response variables are used less often.)
4. Statistically knowledgeable empirical researchers often per-
form statistical tests for variance relationships. But when
they perform such tests they are usually DIRECTLY interested
in studying a relationship between variables as defined by
the expected-value definition. And they are only performing
the variance tests to assist in verifying that the underlying
assumptions of the statistical procedure being used are ade-
quately satisfied. Furthermore, statistically knowledgeable
empirical researchers almost never check whether the third or
higher moments of the response variable change as a function
of a predictor variable. This suggests that empirical re-
searchers generally view study of moments of the response
variable higher than the second as being of little interest
or value.
5. A function (transformation) is sometimes applied to the val-
ues of the response variable in the data analysis of an em-
pirical research project. However, if such a function is
used, the purpose is usually merely to stabilize the variance
of the response variable to satisfy assumptions of the sta-
tistical procedure being used -- not to support direct study
of higher moments or other similar study of the values of the
response variable.
6. Although the expected-value definition does not directly
cover certain cases (such as Jan's V(y|x) case), it covers
all these cases indirectly when the appropriate function is
applied to the values of the response variable, as suggested
by Herman's definition [4].
7. The terminology of the expected-value definition is consis-
tent with common language. For example, it is natural and
informative to report the results of an empirical research
project that found the result in Jan's example as "There is
no evidence of a relationship between x and y but there IS
good evidence of a relationship between x and V(y)."
8. The expected-value definition makes a distinction between
certain types of relationships between variables -- a dis-
tinction that definitions [2], [3], and [5] do not make.
(The distinction is also made by definition [4].) The dis-
tinction is in terms of the function f (which is usually
merely the identity function) that is applied to the values
of the response variable. In cases in which this function is
not the identity function, naming it helps one to understand
the relationship.
9. Empirical researchers are generally interested in minimizing
the (error) variance in the values of the response variable
in a research project. This is because minimizing variance
results in increased precision of prediction or control of
the values of the response variable, which is a widely pur-
sued general goal of empirical research. However, minimiza-
tion of variance is usually not pursued directly in empirical
research. Instead, minimization of variance comes as a sec-
ondary benefit from studying prediction or control of ex-
pected value through relationships between variables. That
is, usually a large part of the variability in the values of
the response variable in an empirical research project is as-
sumed to reflect the fact that this variable DEPENDS on nu-
merous other variables (many of which may be unknown), and
these other "influencing" variables may be varying (either
systematically or at random) within or between entities,
thereby causing some of the variation in the values of the
response variable. (Some of the variation in the response
variable is also due to measurement error, and some of the
variation may be "totally random".) Identifying the influ-
encing variables through studying relationships between vari-
ables in terms of the expected-value definition "removes" the
variation from the values of the response variable that can
be associated with these variables, thereby reducing the "er-
ror" variance in the values of the response variable, and
thereby increasing precision in prediction or control. That
is, researchers generally increase precision through studying
relationships between variables (and improving measurement
methods) -- NOT through direct efforts to somehow reduce
variance WITHOUT studying relationships between variables.
10. The expected-value definition is consistent with the distri-
bution definition. This is because the expected-value defi-
nition is not stated as "if only if". The expected-value
definition gives only a SUFFICIENT condition for a relation-
ship between variables -- it does not give a necessary condi-
tion. (As noted above, this condition defines an empirically
large subset of the cases defined by the distribution defini-
tion.) Thus the expected-value definition leaves open the
possibility that other forms of "relationship" might also di-
rectly qualify, although we need not discuss this esoteric
point with non-statistics-majors.
Consider some features and advantages of the distribution defini-
tion:
1. The distribution definition identifies a class of relation-
ships between variables that the expected-value definition
does not directly identify. These are the relationships that
resemble Jan's V(y|x) case. (However, as noted, the ex-
pected-value definition does identify these cases if an ap-
propriate function is applied to the values of the response
variable.)
2. Unlike the expected-value definition, the distribution defi-
nition (with the exception of [4]) does not force one to look
for a function to deal with cases like Jan's V(y|x) case. If
the inequalities in [2], [3], or [5] are satisfied IN ANY
WAY, the distribution definition declares that a relationship
exists between the two variables. This idea is important
from a theoretical point of view. However, the idea is not
often directly applied in empirical research. This is be-
cause in empirical research it is generally easier to find an
appropriate function (if needed) and then to use the expected-
value definition than it is to study the entire distribution
of the values of the response variable. That is, researchers
generally focus on a key aspect of the distribution, which
is usually the value it is "centered" around, which is
usually (perhaps after a transformation) best represented by
its expected value. Experience has shown that studying the
expected value (or occasionally some other measure of
central tendency) of the response variable (while keeping an
eye on the spread) is an efficient way of breaking down the
concepts to simple yet generally sufficient principles.
3. The distribution definition reflects the Bayesian approach to
the study of relationships between variables. This approach
is reasonably viewed as focusing on the DISTRIBUTION of the
values of the response variable (as opposed to focusing
merely on the EXPECTED VALUE of the response variable). Re-
searchers using the Bayesian approach study the relationship
between a response variable (which may be a parameter) and
zero or more predictor variables by inferring the "posterior"
distribution of the values of the response variable. They
make this inference on the basis of Bayes' theorem and
(a) the values of the response variable and predictor vari-
able(s) (if any) obtained from the entities in the sample
in the research project and
(b) the "prior" distribution of the values of the response
variable (possibly conditioned on the values of the pre-
dictor variables).
Thus the distribution definition directly mirrors the
Bayesian approach. This is a crucial advantage of the dis-
tribution definition if one is using the Bayesian approach.
I am unable to think of other significant features or advantages
of the distribution definition. If readers see other features or
advantages of either definition, I hope they will present them to
this debate.
WHICH DEFINITION IS PREFERRED?
Having considered some features and advantages of the two defini-
tions of the concept of 'relationship between variables', let us
now consider which definition is preferred.
First, which definition is preferred for an introductory statis-
tics course for non-statistics-majors? For such a course, if the
teacher elects to present a formal definition of the concept of
'relationship between variables', I recommend emphasizing the
expected-value definition. I base this on my beliefs that (a)
the expected-value approach is easier to understand, and (b) the
expected-value approach mirrors the methods statisticians and re-
searchers usually use to detect and study relationships between
variables in empirical research.
On the other hand, I recommend emphasizing the distribution defi-
nition if a teacher elects to teach the Bayesian approach in a
statistics course. The distribution definition is preferred in
this case because, as noted, it directly mirrors the Bayesian ap-
proach. (However, I recommend against teaching the Bayesian ap-
proach in an introductory statistics course for non-statistics
majors -- see appendix J.)
Finally, in a statistics course for students who ARE majoring in
statistics or mathematics, or in a statistics course for students
who have sufficient statistical experience, or in general statis-
tical discussion, I believe the preferred definition of "rela-
tionship between variables" should be at the discretion of the
instructor or participants. Reasonable criteria for making the
choice are that the preferred definition for a particular discus-
sion should
1. maximize understanding and
2. provide optimal support for the intended analysis.
In addition to being preferred in Bayesian cases, the distribu-
tion definition can better satisfy the two criteria in some non-
Bayesian cases, especially in some theoretical and mathematical
cases. For example, the technical discussion in appendix D ap-
peals to the distribution definition.
On the other hand, in many other non-Bayesian cases the expected-
value definition seems superior. For example, "standard" analy-
sis of variance seems better viewed in terms of the expected-
value definition. This is because in standard analysis of vari-
ance the resulting p-values are almost always reasonably viewed
as testing for relationships between variables in terms of de-
tecting differences between means of the values of the response
variable -- standard analysis of variance does not (directly)
test anything about the higher moments of the values of the re-
sponse variable. Appendix K further discuses this point.
GENERALIZATION OF THE DEFINITIONS
Definitions [1] through [7] are all definitions of a relationship
between a SINGLE response variable and a SINGLE predictor vari-
able. Appendix H discusses the important issue of generalizing
the definitions to situations with multiple response variables
and multiple predictor variables.
MAIN POINTS
The concept of 'relationship between variables' can be reasonably
defined in terms of the concept of 'expected value' and in terms
of the concept of 'univariate distribution'. The two definitions
are not equivalent. The expected-value definition identifies a
large subset of the cases identified by the distribution defini-
tion. The expected-value definition indirectly identifies the
remaining cases.
The expected-value definition is easier to understand and has
several other significant advantages over the distribution defi-
nition. And for the introductory statistics course for non-
statistics-majors the expected-value definition appears to have
no serious disadvantages. Thus I recommend that a teacher empha-
size the expected-value definition in an introductory statistics
course for non-statistics majors if the teacher elects to present
a formal definition of the concept of 'relationship between vari-
ables'.
The distribution definition is preferred when the Bayesian ap-
proach is used and in some theoretical and mathematical discus-
sions.
The easy-to-understand concept of 'relationship between vari-
ables' is a central unifying concept of both the field of statis-
tics and empirical research. A key use of the concept is to as-
sist researchers in accurate prediction and control. Thus I rec-
ommend that the introductory statistics course for non-statistics-
majors focus on the study of relationships between variables in
empirical research as a means to accurate prediction and control.
This focus is important whether the concept of 'relationship' is
formally defined or is instead informally characterized in terms
of practical examples.
Don Macnaughton
-------------------------------------------------------
Donald B. Macnaughton MatStat Research Consulting Inc
[EMAIL PROTECTED] Toronto, Canada
-------------------------------------------------------
APPENDICES
The appendices to this post are of less general interest, so I
give only their titles and links here:
Appendix A: Is The Concept of a "Random" Variable Necessary in
the Definition of "Relationship Between Variables"?
- at http://www.matstat.com/teach/p0045.htm#a
Appendix B: Equivalence of Definitions [2], [3], [4], and [5]
- at http://www.matstat.com/teach/p0045.htm#b
Appendix C: The Symmetry of Definitions of "Relationship Between
Variables"
- at http://www.matstat.com/teach/p0045.htm#c
Appendix D: Verification and Falsification in the Study of Rela-
tionships Between Variables
- at http://www.matstat.com/teach/p0045.htm#d
Appendix E: A Case When Researchers Do Discuss Independence of
Variables
- at http://www.matstat.com/teach/p0045.htm#e
Appendix F: Equivalence of Definitions [1], [6], and [7]
- at http://www.matstat.com/teach/p0045.htm#f
Appendix G: Rewording [7] to Be Equivalent to the Distribution
Definition
- at http://www.matstat.com/teach/p0045.htm#g
Appendix H: Generalization of [1] Through [8]
- at http://www.matstat.com/teach/p0045.htm#h
Appendix I: The Importance of the Concept of 'Independence of
Variables'
- at http://www.matstat.com/teach/p0045.htm#i
Appendix J: Should the Introductory Statistics Course Teach the
Bayesian Approach?
- at http://www.matstat.com/teach/p0045.htm#j
Appendix K: Do Analysis Of Variance F-Tests Test Variances?
- at http://www.matstat.com/teach/p0045.htm#k
The full essay is at
http://www.matstat.com/teach/p0045.htm
REFERENCES
American Statistical Association. 2002. "Curriculum guidelines
for undergraduate programs in statistical science." Available
at http://www.amstat.org/education/Curriculum_Guidelines.html
Freund, J. E. and Walpole, R. E. 1987. _Mathematical statistics._
4th ed. Englewood Cliffs, NJ: Prentice-Hall.
Hogg, R. V. and Craig, A. T. 1995. _Introduction to mathematical
statistics._ 5th ed. Englewood Cliffs, NJ: Prentice Hall.
Macnaughton, D. B. 1997a. "Re: How should we motivate students in
intro stat? (Response to comments by John R. Vokey)." Posted
to EdStat and sci.stat.edu on April 6, 1977. Available at
http://www.matstat.com/teach/p0024.htm
---- 1997b. "The entity-property-relationship approach to statis-
tics: An introduction for students." Available at
http://www.matstat.com/teach/
---- 1998a. "Re: Eight features of an ideal introductory statis-
tics course. (Response to comments by Gary Smith)." Posted to
EdStat and sci.stat.edu on November 23, 1998. Available at
http://www.matstat.com/teach/p0036.htm
---- 1998b. "Eight features of an ideal introductory statistics
course. Available at http://www.matstat.com/teach/
---- 1999a. "Response to comments by Herman Rubin." Posted to
EdStat and sci.stat.edu on May 16, 1999. Available at
http://www.matstat.com/teach/p0041.htm
---- 1999b. "The introductory statistics course: The entity-
property-relationship approach." Available at
http://www.matstat.com/teach/
Moore, D. S. 1997a. "New pedagogy and new content: The case of
statistics" (with discussion), _International Statistical
Review,_ 65, 123-165.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
http://jse.stat.ncsu.edu/
=================================================================