Hi,

I was wondering whether anyone could help me with the following: as part of my 
MSc I am performing several different stochastic and deterministic missing data 
methods on an example data set, with certain approximate percentages (5, 10, 
20) of missing data removed at random, each resulting in 5 data sets, therefore 
15 in all. The basic methods of complete-case analysis, available-case 
analysis, unconditional mean imputation, one predictor conditional mean 
imputation and Buck's method (1960), have worked with varying degrees of 
success, which are expected/explainable - I mention this as I believe it means 
that my problem is not inherent in the data sets. 

The problem is as follows: having performed EM, using SAS, and then 
substituting the MLE's obtained into standard regression formulae (given by 
Little 1992, JASA 87) and then using these regression formulae to predict the y 
values (having only removed x values originally), most of the 15 data sets work 
very well, giving highly comparable residual values (to the complete data set 
regression analysis). However one does not (one of the 10% cases) , this case 
happens (coincidentally??) to correspond to EM taking a long time (in 
comparison to the other times) to converge, further investigation revealed 
three cases (the above one and two 20% ones, both of which converge slowly) 
where EM behaves strangely, since -2logL decreases at first and then at a point 
(for no apparent reason), jumps up and then proceeds to either decrease or 
increase until convergence is stated.

My question is if anyone has met such behaviour (which I understood in the EM 
was not possible), or if anyone has any advice on how to proceed further, 
alternative code, methods to discover the cause or any other advice - sorry 
this is so long, but I hope I have explained my problem. Also if this is SAS's 
way of saying that convergence is not possible, then why do the values for two 
of the cases result in very good regression formulae?

Many thanks for any help,
Fay Hosking (nee Hughes)
Statistics MSc Student
School of Mathematical and Statistical Sciences
University of Natal, South Africa

If interested the values below are for the three cases, with the first line 
being the starting -2logL value, to the point at where the oddity occurs and 
the second line is the behaviour to convergence (with the number of iterations 
in brackets).

Case 1 (20% missing, regression results reasonable)
(0) 1394.38905 decreases to 998.301458 (239)
(240) 1032.937006 increases to 1032.937863 (397)

Case 2 (20% missing, regression results reasonable)
(0) 1393.69312 decreases to 1012.435198 (170)
(171) 1042.160788 increases to 1042.202897 (279)

Case 3 (10% missing, regression results not comparable - at all!)
(0) 1450.236445 decreases to 996.23881 (234)
(235) 1036.367008 decreases to 1036.366283 (374)
----- Perhaps its worth mentioning that the means and variance-covariance 
matrix from this case does not appear to be badly estimated, but obviously in 
combination (regression formulae) they are not well estimated.
The other 13 data sets all converge in 210 iterations or less - is this 
important??

The SAS code is very short, but the relevant section is:

proc mi data = array nimpute = 0 noprint;
  em itprint converge = 1E-8 outem = outem maxiter = 400;
  var x1 x2 x3 x4 x5 x6 y;
run;

proc print data = outem;
     title 'EM estimates';
run;

Reply via email to