(This post is formatted for viewing with a fixed-width font, such 
as "Courier".)

Quoting a 99/5/9 post of mine, Bob Hayden writes (on 99/5/9)

> ----- Forwarded message from Donald Macnaughton -----
>>
        < snip >
>> It is not necessary to bring formal statistical procedures
>> into the discussion to discuss relationships between vari-
>> ables.  I recommend that teachers capitalize on this fact and
>> give students a strong sense of the concept of a relationship
>> between variables before introducing ANY formal statistical
>> procedures.   
>>
        < snip >
>>
> ----- End of forwarded message from Donald Macnaughton -----      
>
> Without getting too deeply into the main issue here of univari-
> ate versus multivariate, I would like to comment on a couple of
> details.
>
> I think the relationship between a measurement variable and a
> categorical variable is best visualized with parallel boxplots
> -- one for each category -- on the same scale.  Indeed, such
> plots are the main reason to learn boxplots.  

Many readers will agree that plots are essential tools for under-
standing relationships between variables.  Four standard types of 
plot for illustrating the type of relationship Bob describes are

- parallel dot plot

- parallel boxplot

- graph (perhaps with standard-error-of-the-mean bars) and

- parallel stem-and-leaf plot.

To help with discussion of Bob's points, I show the same data 
plotted in each type of plot in a figure on a web page.  (The web 
page also contains the text in this post.)  The figure is at

           http://www.matstat.com/teach/p0043.htm#g

    FIGURE CAPTION:  Four types of parallel plot, each re-
    flecting exactly the same data.  The plots are called 
    (clockwise from the top left) parallel dot plot, parallel 
    boxplot, mean graph with standard-error-of-the-mean bars, 
    and parallel stem-and-leaf plot.  As can be seen on both 
    the parallel dot plot and the parallel stem-and-leaf 
    plot, the counts of the number of values of the response 
    variable available for the three values of the predictor 
    variable are (from left to right) 25, 26, and 24.  

(Appendix A describes how to obtain a higher-resolution copy of 
the figure.) 

The figure reflects a simple empirical research project in which 
a single "discrete" predictor variable is observed at or manipu-
lated through three values in the research entities (or in the 
entities' "environment") and the values of a single "continuous" 
response variable are observed in the same entities.  

(Appendix B discusses the distinction between discrete and con-
tinuous variables.)

(I could have searched data archives to find an appropriate da-
taset on which to base the figure.  However, to save time and to 
get exactly what I wanted, I simply used the SAS normal random 
number generator to make up the values of the response variable 
in the figure.  I specified nominal means of 28, 32, and 36 for 
the three groups and a nominal standard deviation [within each 
group] of 9.)

If I were presenting the situation illustrated in the figure to 
students, I would make it very concrete, perhaps in part as fol-
lows:  The research entities are 75 AIDS patients who were ran-
domly assigned to three groups.  The predictor variable reflects 
three levels of a new drug that were (in double-blind fashion) 
administered to the three groups of patients -- a different level 
to each group.  (To increase the power of the statistical tests, 
one of the levels of the drug was "zero".)  The response variable 
is an appropriate measure of the healthiness of the patients af-
ter six weeks of treatment with the drug.  

(I would tell students that in real AIDS research only two levels 
of the drug would normally be used because, when appropriate lev-
els are chosen, this also helps increase the power of the statis-
tical tests.)

Also, if I were presenting the situation illustrated in the fig-
ure to students, I would carefully discuss the important implica-
tions of the figure for the treatment of AIDS patients, including 
how the implications are derived and the main caveats.

                            *   *   *

The four plots in the figure are quite different from each other, 
even though they all reflect exactly the same data.  What are the 
advantages and disadvantages of each type of plot?

Consider the parallel dot plot.  Dot plots (Tukey 1977, p. 50; 
Wilkinson 1999) have the advantage that they are closer to the 
raw data than the other three types of plots -- dot plots picto-
rially reflect the exact tabled values of both the response and 
predictor variables for each entity under study in the research.  
Because dot plots are close to the raw data, students find them 
easy to understand.  

Parallel dot plots can be easily drawn by any software that can 
draw scatterplots.  (If many data values are present, it is help-
ful to slightly offset the dots that lie atop one another, as 
shown in the figure.  This offsetting is unfortunately not avail-
able as a simple option in most plotting software, so the user 
must do it manually or write a program to do it semi-
automatically.  Appendix C discusses some offsetting algorithms.)

Consider the parallel boxplots in the figure and consider any one 
of the three boxplots.  To understand this boxplot a student must 
understand the notion of the quantiles of a distribution of nu-
meric values (in particular, median and quartile) and a conven-
tion that defines the length of the whiskers (Tukey 1977, pp. 39-
53).  Although these technical concepts are not complicated, they 
make boxplots harder for students to understand than dot plots.

Boxplots have the advantage over the other types of plots that 
they highlight outliers -- points that lie well away from the 
other points on the plot.  For example, note the solitary outlier 
in the upper tail of the rightmost boxplot in the figure.

Consider the graph in the lower-right quadrant of the figure.  
Graphs showing the mean (or median) values of the response vari-
able for each value of the predictor variable (possibly with 
standard-error-of-the-mean bars) are often used in reports in the 
empirical research literature and in the popular press.  Like 
boxplots, graphs showing the mean or median with error bars are 
harder for students to understand than dot plots because these 
graphs are based on technical concepts (i.e., a measure of the 
central tendency of a distribution and a measure of the spread). 

Furthermore, graphs with standard-error-of-the-mean bars hide the 
extent of the distribution because (as dictated by the formula 
for the standard error of the mean) the height of each bar is 
strongly (inversely) dependent on the number of values of the re-
sponse variable available for the given value of the predictor 
variable.  

On the other hand, graphs with standard-error-of-the-mean bars 
are useful if we wish to focus on the "average" relationship be-
tween the two variables under study.  We can thus focus on a nar-
rower range of values of the response variable, as is reflected 
by the difference between the vertical axis scale on the plot in 
the lower-right quadrant of the figure and the vertical axis 
scales on the two plots in the upper half of the figure. 

Furthermore, graphs with standard-error-of-the-mean bars are use-
ful because they enable an experienced researcher to quickly per-
form a "visual t-test".  This gives one a visual confirmation of 
what takes place mathematically in the t-test.  Appendix D de-
scribes the visual t-test.  

(Standard-error-of-the-mean bars enable a visual t-test because 
the bars are scaled to reflect the number of values of the re-
sponse variable available for a given value of the predictor 
variable.  Boxplots cannot be used for visual t-tests because 
they are not so scaled.)

Both parallel boxplots and graphs have important advantages over 
parallel dot plots:  Boxplots and graphs SUMMARIZE the univariate 
distribution of the values of the response variable for a given 
value of the predictor variable.  Thus boxplots and graphs hide 
some of the detail that is present in the corresponding parallel 
dot plot.  Also, boxplots and graphs are often easier to draw and 
generally take up less horizontal space on a page than dot plots.

Although in certain situations boxplots and graphs have advan-
tages over dot plots, students should learn that before they use 
a summary plot they should study a dot plot of the raw data to 
ensure that the summary plot is not hiding some important feature 
of the distribution of the values, as illustrated by Tukey (1977, 
pp. 49-50).

Consider the parallel stem-and-leaf plot in the figure.  This 
type of plot is useful when we need to display details of the ac-
tual values of a variable (Tukey 1977, pp. 6 - 16).  On the other 
hand, when these details are not needed, this type of plot has a 
significant disadvantage:  The extra textual detail distracts the 
viewer from the overall sense of the distribution of the values.  
The overall sense is often more important than the mostly unsub-
stantial specific numerical differences that are reflected in the 
digits in the "leaves" of the plot.

Also, stem-and-leaf plots are inferior to dot plots at highlight-
ing gaps in the distribution of a set of values.  This can be 
seen by studying the gaps in the dot plot and stem-and-leaf plot 
in the figure, especially the gap for the outlier in the upper 
tail when the predictor variable is at level 3.

Appendix E discusses some other approaches to displaying the data 
in the figure.  

Because I believe dot plots are the easiest of the various types 
of plots for students to understand, I recommend that discussion 
of parallel plots in the introductory statistics course begin 
with parallel dot plots.  I recommend that this discussion be 
followed by discussion of parallel boxplots and graphs because 
the latter two types of plots are often used in reports of em-
pirical research.  

                            *   *   *

Bob's example studies a relationship between variables in which 
the response variable is continuous, but the predictor variable 
is discrete.  Bob may be suggesting that we use this type of ex-
ample as the FIRST detailed example of a relationship between 
variables in an introductory statistics course.  However, other 
types of example are also possible.  In particular, instead of 
using a discrete predictor variable we could use a continuous 
one.  Which type of relationship is best for the first detailed 
example of a relationship between variables at the beginning of 
an introductory course?

I recommend that the first detailed example of a relationship use 
response and predictor variables that are BOTH CONTINUOUS for the 
following reasons:

- To facilitate student understanding, the first example of a re-
  lationship should be as simple as possible.  This suggests us-
  ing an example of an observational research project as opposed 
  to an example of an experiment.  This is because with experi-
  ments students must understand the concept of random assignment 
  and the concept of "manipulation" of the values of a predictor 
  variable.  These concepts are not needed if we use an example 
  of an observational research project.

- It is desirable (when possible) to use continuous variables in 
  empirical research because a continuous variable almost always 
  carries more information in its values than a discrete variable 
  measuring the same property.  (An important exception is that 
  the "manipulated" variables in experiments are almost always 
  discrete because appropriately used discrete manipulated vari-
  ables provide more powerful statistical tests.)

- Many examples of observational research projects are available 
  that have both a continuous response variable and a continuous 
  predictor variable.

These points suggest that the first detailed example of study of 
a relationship between variables in an introductory course should 
be an example of an observational research project that studies 
the relationship between two continuous variables.  

I recommend the following example:  The response variable is the 
mark (say, out of 100) that each student obtained in a particular 
course of study.  The predictor variable is the total amount of 
time (in minutes) each student spent working on the course during 
the term, as tracked by student time diaries.  You can pique stu-
dent curiosity by using the data for the students in the preced-
ing term of your present course.  Appendix F discusses the logis-
tics of tracking student time spent on a course.

Studying the relationship between study-times and course-marks is 
effective because this relationship is of serious direct interest 
to most students.  Also, the example provides an easily under-
stood basis for discussing several important general concepts of 
statistics and empirical research such as measurement accuracy, 
weak relationships between variables, alternative explanations, 
the need for hypothesis testing about the presence of a relation-
ship, causation, multiple causation, observational versus experi-
mental research, and bivariate regression.

In an introductory course that follows the recommended approach 
and begins with an example with two continuous variables, the 
first graphic that students see is a scatterplot rather than a 
parallel plot.  After students understand how scatterplots illus-
trate the relationship between two continuous variables, we can 
THEN introduce the parallel dot plot as a special type of scat-
terplot that illustrates a new type of relationship between vari-
ables -- a relationship in which the predictor variable is no 
longer continuous, but is instead discrete.

                            *   *   *

Let us return to Bob's comments.  Recall that he says above that 
certain relationships between variables are best visualized with 
parallel boxplots.  He continues

> However, I see many texts that focus on the mechanics of con-
> structing a single boxplot, but then never go on to use them to
> visually compare several groups.  Perhaps this is the extreme
> in being adamantly univariate.  

I agree.


> On the other hand, I do think it is useful for students to
> learn to make boxplots without a computer, and for purposes of
> teaching this, there is an advantage in concentrating on one
> boxplot at a time.  

I agree that students can best understand boxplots if they con-
centrate on one boxplot at a time.  However, as discussed above, 
if a teacher wishes to use a discrete predictor variable in the 
first detailed example of a relationship, I recommend NOT start-
ing with boxplots, but with dot plots.  Under this approach I be-
lieve it is not necessary to begin with discussion of a dot plot 
of a single distribution.  Instead, after introducing the concept 
of a relationship between variables (which is what all the paral-
lel plots illustrate), we can immediately introduce a parallel 
dot plot to students as a useful tool for illustrating certain 
relationships.


> HOWEVER, as soon as the students understand what a boxplot IS,
> you can immediately put the boxplots to good use by having a
> computer generate parallel boxplots comparing several groups. 

As noted, I agree with Bob that parallel plots (dot plots, box-
plots, or graphs) are fundamental tools for illustrating certain 
relationships between variables.  However, an issue on which Bob 
and I may disagree concerns the ORDER in which a teacher should 
introduce the ideas of

(a) relationships between variables and 

(b) parallel plots. 

For students who are not majoring in statistics or mathematics, I 
recommend introducing relationships between variables FIRST, be-
fore we introduce individual or parallel plots (or scatterplots).  
On the other hand, Bob may be recommending that we introduce re-
lationships between variables SECOND, after we have introduced 
individual (and possibly parallel) plots.

Clearly, the approach of introducing individual or parallel uni-
variate plots (or scatterplots) before we introduce relationships 
between variables has SOME appeal.  In particular, if we follow 
this approach, when the time comes in the course to illustrate a 
relationship between variables with plots the students will al-
ready be familiar with the plots.

However, as I discuss elsewhere

- almost all the commonly used statistical procedures can be rea-
  sonably viewed as procedures for studying relationships between 
  variables (1999, sec 4.3) and

- almost all formally reported empirical research projects can be 
  reasonably viewed as studying relationships between variables 
  (1999, app. B).

Thus the concept of 'relationship between variables unifies al-
most all statistical procedures and almost all empirical research 
projects.  Therefore, I recommend that teachers center the intro-
ductory statistics course on the fundamental unifying concept of 
'relationship between variables'.

I illustrate in two papers how a teacher can easily introduce the 
concept of 'relationship between variables' in an introductory 
course without having to first cover univariate plots (1996, 
1999).  The 1999 paper also discusses how concepts related to 
univariate distributions are boring for students because the con-
cepts have no obvious practical value (sec. 6.9).

In view of these points, I recommend introducing relationships 
between variables first.  However, shortly after introducing re-
lationships between variables, I recommend that teachers intro-
duce the various types of plots that help us to ILLUSTRATE rela-
tionships between variables.  Such plots are essential tools for 
understanding relationships.

                            *   *   *

In my 99/5/9 post I discuss why I believe teachers continue to 
discuss univariate distributions at the beginning of introductory 
statistics courses even though it is no longer necessary to dis-
cuss this topic.  As part of that discussion I say

>> In the past, before the arrival of good statistical computing
>> packages, a person performing a statistical analysis had to
>> understand the mathematics of statistics in order to carry out
>> the (necessarily manual) computations.  (It is almost impossi-
>> ble to perform statistical computations manually if one does
>> not properly understand them.)

Quoting this passage Bob writes

> I would have to disagree that carrying out statistical computa-
> tions "by hand" requires or demonstrates statistical under-
> standing.  It only demonstrates that the steps in the computa-
> tion have been mastered.  Computers grind out statistical com-
> putations all the time without understanding them.  Programmers
> implement statistical formulas all the time with little or no
> understanding of why anyone wants to calculate this or what it
> means.  In the days before students mindlessly pushed buttons
> on their calculators, they mindlessly pushed pencils across
> pages of paper. 

I agree with Bob that some people learn to perform statistical 
computations without understanding what they are doing -- my 
point above does not contradict this point.  My point is that in 
the days before we had good computer software to perform statis-
tical computations, if one wished to perform a responsible sta-
tistical analysis, one had to understand the underlying mathemat-
ics.  This was necessary to ensure that the computations were 
performed correctly.  

Nowadays, as Bob implies, the need for understanding is still 
very much present.  But for students who are not majoring in sta-
tistics or mathematics, it is no longer necessary to attain 
MATHEMATICAL understanding.  This is because a computer can do 
all the standard mathematical computations of statistics, and 
generally do them very well.  What students need instead of 
mathematical understanding is "conceptual" understanding.

As I discuss in the 1999 paper, I believe we can give students a 
thorough conceptual understanding of the role of the field of 
statistics by showing them that statistics helps us to study 
variables and relationships between variables as a means to accu-
rate prediction and control.  A student need not understand the 
underlying mathematics of statistics to understand these simple 
ideas.

-------------------------------------------------------
Donald B. Macnaughton   MatStat Research Consulting Inc
[EMAIL PROTECTED]      Toronto, Canada
-------------------------------------------------------
 

APPENDICES AND REFERENCES

Because the appendices to this post are of less general interest, 
I give only their titles here:

Appendix A: How To Obtain a Higher-Resolution Copy of the Figure
   - this appendix is at 
        http://www.matstat.com/teach/p0043.htm#a

Appendix B: Continuous Versus Discrete Variables
   - this appendix is at 
        http://www.matstat.com/teach/p0043.htm#b

Appendix C: Offsetting Overlapping Points on Dot Plots
   - this appendix is at 
        http://www.matstat.com/teach/p0043.htm#c

Appendix D:  The Visual t-Test
   - this appendix is at 
        http://www.matstat.com/teach/p0043.htm#d

Appendix E: Other Methods for Displaying the Data in the Figure
   - the following methods are discussed:  parallel histograms, 
     comparison circles, diamond plots, and violin plots
   - this appendix is at 
        http://www.matstat.com/teach/p0043.htm#e

Appendix F: The Logistics Of Tracking Student Time Spent On A
            Course
   - this appendix is at 
        http://www.matstat.com/teach/p0043.htm#f


The full post with the low-resolution figure, appendices, and 
references is at 
 
          http://www.matstat.com/teach/p0043.htm




=================================================================
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
                  http://jse.stat.ncsu.edu/
=================================================================

Reply via email to