Hi,

I was wondering whether anyone could help me with the following: as part of my MSc I 
am performing several different stochastic and deterministic missing data methods on 
an example data set, with certain approximate percentages (5, 10, 20) of missing data 
removed at random, each resulting in 5 data sets, therefore 15 in all. The basic 
methods of complete-case analysis, available-case analysis, unconditional mean 
imputation, one predictor conditional mean imputation and Buck's method (1960), have 
worked with varying degrees of success, which are expected/explainable - I mention 
this as I believe it means that my problem is not inherent in the data sets. 

The problem is as follows: having performed EM, using SAS, and then substituting the 
MLE's obtained into standard regression formulae (given by Little 1992, JASA 87) and 
then using these regression formulae to predict the y values (having only removed x 
values originally), most of the 15 data sets work very well, giving highly comparable 
residual values (to the complete data set regression analysis). However one does not 
(one of the 10% cases) , this case happens (coincidentally??) to correspond to EM 
taking a long time (in comparison to the other times) to converge, further 
investigation revealed three cases (the above one and two 20% ones, both of which 
converge slowly) where EM behaves strangely, since -2logL decreases at first and then 
at a point (for no apparent reason), jumps up and then proceeds to either decrease or 
increase until convergence is stated.

My question is if anyone has met such behaviour (which I understood in the EM was not 
possible), or if anyone has any advice on how to proceed further, alternative code, 
methods to discover the cause or any other advice - sorry this is so long, but I hope 
I have explained my problem. Also if this is SAS's way of saying that convergence is 
not possible, then why do the values for two of the cases result in very good 
regression formulae?

Many thanks for any help,
Fay Hosking (nee Hughes)
Statistics MSc Student
School of Mathematical and Statistical Sciences
University of Natal, South Africa

If interested the values below are for the three cases, with the first line being the 
starting -2logL value, to the point at where the oddity occurs and the second line is 
the behaviour to convergence (with the number of iterations in brackets).

Case 1 (20% missing, regression results reasonable)
(0) 1394.38905 decreases to 998.301458 (239)
(240) 1032.937006 increases to 1032.937863 (397)

Case 2 (20% missing, regression results reasonable)
(0) 1393.69312 decreases to 1012.435198 (170)
(171) 1042.160788 increases to 1042.202897 (279)

Case 3 (10% missing, regression results not comparable - at all!)
(0) 1450.236445 decreases to 996.23881 (234)
(235) 1036.367008 decreases to 1036.366283 (374)
----- Perhaps its worth mentioning that the means and variance-covariance matrix from 
this case does not appear to be badly estimated, but obviously in combination 
(regression formulae) they are not well estimated.
The other 13 data sets all converge in 210 iterations or less - is this important??

The SAS code is very short, but the relevant section is:

proc mi data = array nimpute = 0 noprint;
  em itprint converge = 1E-8 outem = outem maxiter = 400;
  var x1 x2 x3 x4 x5 x6 y;
run;

proc print data = outem;
     title 'EM estimates';
run;

Reply via email to