Re: Scale Reliability

Donald F. Burrill Tue, 7 Dec 1999 21:08:25 -0800
There is presumably a reason (you haven't told us otherwise, anyway) 
for using 4 items instead of the entire original scale of 20-some.  
Might be useful to state that reason somewhere.

Consult your favourite measurement textbook:  there's a formula 
(Spearman-Brown) for estimating the reliability of a measure as a 
function of the number of items, given an empirical estimate of the 
reliability of a measure made up of a given number of items.  Starting 
with the known reliability of the full scale (and you _should_ know 
exactly both the number of items on that scale and its published 
reliability -- if in presenting your results it became evident that you 
DIDn't have that information cold, _I_ would have been a bit frosty 
with you had I been either on your committee or invited to be present 
and comment!), estimate the reliability to be expected of a measure 
consisting of only 4 items.  My guess is that the estimated reliability 
would not exceed the 0.3 you report empirically.

On Tue, 7 Dec 1999, Magill, Brett wrote:

> Just wanted people's thought on the following:
> 
> I am a graduate student in sociology studying individual's perceptions 
> of control (locus of control) using existing data. 

        By "existing data" do you mean that you're conducting a secondary 
analysis of data originally collected for some other purpose?  If that be 
the case, the decision to include only four items must have predated your 
association with the data;  I fail to see why you would be criticized for 
summing them, since the choice of "four" was not yours.  And if as I 
suspect the reliability you report is about what one would expect, it's 
not at all clear why "your peers" would bother to criticize (unless 
they're addicted to nit-picking, in which case there's hardly any reason 
to pay them any attention). 
        You also do not mention whether the summing of these four items 
produced a variable that showed some encouraging utility.  I'm inclined 
to suspect you wouldn't be raising these questions if it didn't, though. 
The fact that a measure (however apparently "unreliable" it may be) 
shows a useful validity is itself evidence that the measure is useful. 
 [One should bear in mind also that reliability is not, strictly 
speaking, a characteristic merely of a measure -- it's a characteristic 
of the measure IN THE CONTEXT IN WHICH IT WAS USED.  That is, it 
reflects BOTH some feature(s) of the measure AND some feature(s) of the 
sample of respondents on which the measure was used.  Use the same 
measure on a different population and you'll get a different value of 
alpha.  (And different validity as well, but that's another story.)]

> The data set includes four items to measure this construct which were 
> taken from a larger scale of more than twenty, the larger scale 
> reaching an acceptable level of reliability (I do not know the exact 
> level, but it is a widely researched and used instrument) ...
        (As remarked above, you SHOULD know.  Exactly.  And you should 
know something about the population on a sample of which the published 
reliability was based.  YOUR population might or might not be comparable 
-- and if not, the unlikeness might naturally tend toward low alpha 
values.) 

> ... in previous research.  The four items that were included were 
> selected as the best measures of the construct based on empirical 
> evidence (item-total correlations, factor analysis).

> In my own research, I used these items and decided to sum responses 
> across these four likert-type items.  However, the Alpha reliability is 
> very low 0.30 (items were reverse scored as necessary and coding was 
> double-checked). 
        This may not be unreasonably low...

> I defended the decision to sum the items, despite the low Alpha, based 
> on the fact that they were selected from a larger set of items which 
> are internally consistent.  In presenting my findings, I was heavily 
> criticized for this decision.
        Who were the critics?  Below you refer to "my peers":  fellow
graduate students?  (They tend to be the worst critics.  Not enough
experience to have a sense of proportion about things, and nothing to lose
by being hypercritical;  they may even perceive themselves to be "making
points" in some sense.) Faculty members?  (What were the reactoins of your
own committee members, both to your presentation and to the criticisms
that arose?) What alternatives, if any, were offered, and how were they
justified?  (If none, I'd be inclined to suspect the critics of expressing
unhappiness that the research reported wasn't the research they wish had
been carried out.  Those criticisms one can ignore.  It is not proper to
criticize an orange for not being an apple.)

> Now, I could use individual items and a procedure such as logistic
> regression (I was using GLM before with this scale as the dependent and 
> a sample of better than 5000) without changing my conclusions (I ran 
> logistic models anticipating the criticism), however I was not 
> convinced that this is necessary.
        Doubtless there are all sorts of things you _could_ do.  But to 
what end?  Is the question you want to ask of the universe of discourse 
answered, at least in part, by the analyses you chose to report?  Would 
that question (or those questions) be any better answered by another 
procedure?  Would interpretationsof the results of another procedure be 
more readily understood by your audience?  [One has suspicions...]

> My question is, is summing these items defensible or at least as 
> defensible as summing any set of likert-type items to produce a single 
> score? 
        Seems to me the problem hinges on the small (tiny, really) number 
of items.  If that small number is enough to produce interesting results, 
that fact is itself interesting.  (Given your alpha of 0.3, it might be 
interesting to estimate the number of like items required to generate an 
alpha of, say, 0.8 or 0.9.  Or whatever value is considered "respectable" 
by the fashion-setters of your colleagues.  That could be presented as 
the first step in the design of a follow-up study to confirm and extend 
the present results (assuming, of course, that they're worth confirming 
and extending).)
        If, contrary to what I thought I understood, your research was 
devised and carried out by you (instead of making use of existing data), 
it may be fair game to fault you for using so few items.  As remarked 
above, you could have predicted that the measured reliability would be 
low, and should have addressed that potential problem in the proposal, 
either by using more items than 4, OR by addressing the intrinsic nature 
of "reliability" and its ramifications in this context.

> Where could I find support for what I am doing if it is (clearly 
> my peers won't just take my word for it)?  
        Perhaps some of the previous remarks will suggest avenues you 
might pursue ...
                        -- DFB.
 ------------------------------------------------------------------------
 Donald F. Burrill                                 [EMAIL PROTECTED]
 348 Hyde Hall, Plymouth State College,          [EMAIL PROTECTED]
 MSC #29, Plymouth, NH 03264                                 603-535-2597
 184 Nashua Road, Bedford, NH 03110                          603-471-7128
Re: Scale Reliability

Reply via email to