I have some two-stage population surveys that I need to analyze. If you'll 
bear with me, I'll try to progress from simple to complex as the possible 
designs progresses from simple to complex. This is to achieve clarity for 
my self and not waste your time. I hope some less proficient readers will 
benefit, too. I'll assume ignorable nonresponse.

The practical issue is whether it is better to use:
1. S-PLUS (Schafers software) for multiple imputation, or
2. SUDAAN, which is specialized to analyse two-stage surveys (but does not 
concern non-response in general).

"Ordinary" survey analysis software accounts for:
a. stratified sampling in the first or second stage
b. differential sampling probabilities within strata
c. clustering
d. a two-stage design (SUDAAN only)

As far as I know, multiple imputation gives you the same capabilities, and 
it explicitly specifies a non-response model and accounts for non-response. 
I don't know about clustering, since I don't have it.

In the first stage, a simple random sample is drawn from a finite 
population.  A postal questionnaire is sent out and returned. Let X be 
covariates that are observed for all (e.g., age and gender), but is not a 
design variable.  Let Y be the observed data that potentially have missing 
values (e.g. wheezing 0/1, asthma 0/1). Let R indicate response (0/1).

Case 1: full response from everybody in the sample
Relative to a census, the available data are MCAR, and a complete-data 
analysis is valid.

Case 2: some non-response, fully random
Relative to case 1 (full data), the data are MCAR, and a complete-data 
analysis is valid. However, you will lose power due to the exclusion of 
subjects with only X observed. If you use a multiple imputation approach, 
you will not throw away the fact that you do know the X'es. This will 
provide some increase in precision.

Case 3: some non-response, related to X and only X
e.g. the data are missing at random within categories of X and only X.
Multiple imputation is appropriate, with a simple model that uses only X, R 
and Y to impute.

Now, in a second stage, a simple random sample is drawn from the 
*non-responders* in the first stage. Let I indicate whether or not a person 
is selected for the follow-up.

Case 4: some non-response in the follow-up stage, related to X and only X
Multiple imputation is appropriate, with a simple model that uses X, R, I 
and Y to impute.
Now, do I need to "weigh up" the selected respondents ? In "ordinary" 
survey-analysis software you would have to. But as I understand, you impute 
the answers for the non-selected persons instead of "weighting" only the 
selected people. So I think I don't need that. This scenario 4 is what 
often happens in surveys, and so should have some interest.

Another scenario is that in which you do an inexpensive questionnaire on 
the entire sample, and do biological measurements on a subsample. As long 
as the followup sampling is simple random, case 4 applies. However, some  
times you want to do a stratified sampling depending on answers in the 
previous questionnaire:

Case 5: instead of a simple random sample drawn from the non-responders, 
draw a _stratified sample_ with differential sampling probabilities, 
depending on Y.  E.g., select 50% of the wheezers and 10% of the 
non-wheezers. Let RR be the variable indicating response in the clinical 
followup, and let YY be the new data that you get (e.g. lung function).

This is the case that I need to crack now. Conceptually, you would imagine 
two stages:
a) impute Y depending on X and R. Thus, you fill out the missing data in 
the first stage. Then:
b) impute YY depending on X, R, Y (imputed and real), I and RR. Thus, you 
fill out the missing data in the second stage.

However, if you do steps a) and b) only once, you'll lose the uncertainty 
that comes with imputation. You could do 5 steps of imputation a), and 
within each step of a) do 5 steps of imputation b). You'd end up with 25 
data sets, and I'm not sure how to summarize this. Does anybody have any 
good suggestions for case 5 ? Can you do it all in a single model ? It is 
quite common in my field, and so should have some general interest.

The bottom line is: do I need SUDAAN ? I really don't think so.

Yours gratefully,

Jan Brogger
PhD student, Respiratory Epidemiology Group, University of Bergen, Norway

Reply via email to