The following article may be of interest to some of you who are trying to
get across the notion of reliability, particularly those who are teaching
H.S. or young college students who have recently gone through high-stakes
achievement/competency testing programs.  You can also download directly
from the New York Times web site at 

       http://www.nytimes.com/2000/09/13/national/13LESS.html
        

New York Times, September 13, 2000

LESSONS
How Tests Can Drop The Ball

By RICHARD ROTHSTEIN

MIKE PIAZZA, batting .332, could win this year's Most Valuable Player
award. He has been good every year, with a .330 career average, twice a
runner-up for m.v.p. and a member of each All- Star team since his
rookie season.

The Mets reward Piazza for this high achievement, at the rate of $13
million a year.

But what if the team decided to pay him based not on overall
performance but on how he hit during one arbitrarily chosen week? How
well do one week's at-bats describe the ability of a true .330 hitter?

Not very. Last week Piazza batted only .200.  But in the second week of
August he batted
.538. If you picked a random week this season,
you would have only a 7-in-10 chance of choosing one in which he hit
.250 or higher.

Are standardized-test scores, on which many schools rely heavily to
make promotion or graduation decisions, more indicative of true ability
than a ballplayer's weekly average?

Not really. David Rogosa, a professor of educational statistics at
Stanford University, has calculated the "accuracy" of tests used in
California to abolish social promotion. (New York uses similar tests.)

Consider, Dr. Rogosa says, a fourth-grade student whose "true" reading
score is exactly at grade level (the 50th percentile). The chances are
better than even (58 percent) that this student will score either above
the 55th percentile or below the 45th on any one test.

Results for students at other levels of true performance are also
surprisingly inconsistent. So if students are held back, required to
attend summer school or denied diplomas largely because of a single
test, many will be punished unfairly.

About half of fourth-grade students held back for scores below the 30th
percentile on a typical reading test will actually have "true" scores
above that point. On any particular test, nearly 7 percent of students
with true scores at the 40th percentile will likely fail, scoring below
the 30th percentile.

Are Americans prepared to require large numbers of students to repeat a
grade when they deserve promotion?

Professor Rogosa's analysis is straightforward. He has simply converted
technical reliability information from test publishers (Harcourt
Educational Measurement, in this case) to more understandable
"accuracy" guides.

Test publishers calculate reliability by analyzing thousands of student
tests to estimate chances that students who answer some questions
correctly will also answer others correctly. Because some students at
any performance level will miss questions that most students at that
level get right, test makers can estimate the reliability of each
question and of an entire test.

Typically, districts and states use tests marketed as having high
reliability. Yet few policy makers understand that seemingly high
reliability assures only rough accuracy  for example, that true 80th
percentile students will almost always have higher scores than true
20th percentile students.

But when test results are used for high-stakes purposes like promotion
or graduation decisions, there should be a different concern: How well
do they identify students who are truly below a cutoff point like the
30th percentile? As Dr. Rogosa has shown, the administering of a single
test may do a poor job of this.

Surprisingly, there has not yet been a wave of lawsuits by parents of
children penalized largely because of a single test score. As more
parents learn about tests' actual accuracy, litigation regarding
high-stakes decisions is bound to follow. Districts and states will
then have to abandon an unfair reliance on single tests to evaluate
students.

When Mike Piazza comes to bat, he may face a pitcher who fools him more
easily than most pitchers do, or fools him more easily on that day.
Piazza may not have slept well the night before, the lights may bother
him, or he may be preoccupied by a problem at home. On average, over a
full season, the distractions do not matter much, and the Mets benefit
from his overall ability.

Likewise, when a student takes a test, performance is affected by
random events. He may have fought with his sister that morning.  A test
item may stimulate daydreams not suggested by items in similar tests,
or by the same test on a different day. Despite a teacher's warning to
eat a good breakfast, he may not have done so.

If students took tests over and over, average accuracy would improve,
just as Mike Piazza's full-season batting average more accurately
reflects his hitting prowess. But school is not baseball; if students
took tests every day, there would be no time left for learning.

So to make high-stakes decisions, like whether students should be
promoted or attend summer school, giving great importance to a single
test is not only bad policy but extraordinarily unfair. Courts are
unlikely to permit it much longer.

Copyright 2000 The New York Times Company



=================================================================
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
                  http://jse.stat.ncsu.edu/
=================================================================

Reply via email to