I think it is more of an implementation issue (rather than methodology). Multivariate normal method needs an initial draw of the parameters (mean and the covariance matrix) and then draws the missing values conditional on the observed data and the drawn parameters and thus begins the Gibbs cycle. If we have the ability to set some covariances to be zero and draw others in the initial draw then we should be ok. If you were to use winbugs, for example, and provide the initial values of the parameter then you should ok. Probably you can do the same in PROC MI as well.
Raghu From: Impute -- Imputations in Data Analysis [[email protected]] on behalf of Paul von Hippel [[email protected]] Sent: Tuesday, September 20, 2011 2:17 PM To: [email protected] Subject: Re: Imputing panel data, constraining correlations at long lags Raghu -- This is a great suggestion, thank you! What surprises me here is the suggestion that a chained-equation approach can solve this problem, but a multivariate normal approach cannot. I had thought the two were equivalent for normal data. It seems like a substantial advantage if the chained-equation approach can handle more difficult patterns of missingness. Can you say a little about what gives the chained-equation approach this advantage? Best wishes, Paul von Hippel On Tue, Sep 20, 2011 at 12:12 PM, Raghunathan, Trivellore <[email protected]<mailto:[email protected]>> wrote: There are two possible ways to conceptualize this problem and use one of the MI software. Suppose that R stands for reading and M stands for math. F, W, S stands for Fall Winter and Spring and number stands for the year. Option 1: Arrange the data as Subject-A RF1 RW1 RS1 MF1 MW1 MS1 RF2 RW2 RS2 MF2 MW2 MS2 Subject-B RF2 RW2 RS2 MF2 MW2 MS2 RF3 RW3 RS3 MF3 MW3 MS3 Subject-C Subject-D This approach will create a n x 72 completed data matrix. You can drop the imputations in the non-administered portion of the data set for some analysis or retain them, especially, in cross sectional analysis. The partial correlation between ab1 and cd3 will be practically zero when IVEware is used. We have tested this by using IVEware on "file-matching" pattern of missing data. Option 2: Though not sure, one may be able to use the following structure under some assumptions: Subject A: RF1 RW1 RS1 MF1 MW1 MS1 RF2 RW2 RS2 MF2 MW2 MS2 Year=1.5 Subject B: RF2 RW2 RS2 MF2 MW2 MS2 RF3 RW3 RS3 MF3 MW3 MS3 Year=2.5 Subject C: RF3 RW3 RS3 MF3 MW3 MS3 RF4 RW4 RS4 MF4 MW4 MS4 Year=3.5 Use year as a covariate and possibly some interactions. This makes assumptions about the stability of regression relationship over time and the residual covariance matrix has a common 12 by 12 block diagonal matrices. My own preference is to use the Option 1 if the sample size is large and use Option 2 is the sample size is small. Interesting problem. Raghu From: Impute -- Imputations in Data Analysis [[email protected]<mailto:[email protected]>] on behalf of Paul von Hippel [[email protected]<mailto:[email protected]>] Sent: Tuesday, September 20, 2011 10:53 AM To: [email protected]<mailto:[email protected]> Subject: Re: Imputing panel data, constraining correlations at long lags Thanks, I thought a little about this. It's not obvious to me what the prior would be. Any recommendations? On Tue, Sep 20, 2011 at 9:41 AM, Juned Siddique <[email protected]<mailto:[email protected]>> wrote: Hi Paul, If you use a Bayesian approach like Proc MI for the problem below, the posterior correlation between wave 1 and 3 is just the prior correlation. So one approach might be to use an informative prior for the covariance matrix which you can do in Proc MI. -Juned From: Impute -- Imputations in Data Analysis [mailto:[email protected]<mailto:[email protected]>] On Behalf Of Paul von Hippel Sent: Tuesday, September 20, 2011 8:21 AM To: [email protected]<mailto:[email protected]> Subject: Re: Imputing panel data, constraining correlations at long lags Thanks, Dave. You've come up with a nicely simplified version of my problem. Suppose I had only three waves of data, with every subject missing either wave 1 (your pattern A) or wave 3 (your pattern B). Ordinarily I would put the data in wide format -- A O1 O2 M3 B M1 O2 O3 -- and impute using a multivariate normal model. However, I don't think that would work in this case because the MVN model would want to estimate the correlation between wave 1 and wave 3, and there are no cases where both wave 1 and wave 3 are observed. However, if I could tell the software that this was, say, an AR(1) process -- or, equivalently, that partial correlation between waves 1 and 3 is zero -- I'd be in business. This could be done using MVN software that allowed me to impose constraints on the covariance matrix, or imputation software for serially correlated data. Does such software exist? Best, Paul ________________________________ From: David Judkins <[email protected]<mailto:[email protected]>> To: [email protected]<mailto:[email protected]> Sent: Tuesday, September 20, 2011 7:25 AM Subject: Re: Imputing panel data, constraining correlations at long lags Paul, This sounds pretty challenging. Reminds me of Andrew Gelman's JSM talk and 1998 JASA paper on imputation of questions not asked. It also reminds me of a remark some speaker made this year at JSM about almost all natural processes being Markov chains. Not sure I buy that, but I think he meant that if you have a rich enough state vector, then one past observation is all you need. Of course, that would be trivially true if the state vector contained lagged latent values. In this case,I doubt your state vector is rich enough to compensate for the brevity of the student-level time series, but I guess you have to work with what you have. Whatever you do I imagine will involve a lot of custom programming. However, you might be able to Raghu's IVEware on a series of specially reshaped versions of your data. For example, to impute year 3 for subject a and year 1 for subject B, you might create a a dataset with only A and B records in it shaped like this: A O1 O2 M3 B M1 O2 O3 Once that was done, you could proceed to imputing Year 4 on A and B records and Year 2 on C records with a dataset shaped from B and C records as A O2 I3 M4 B O2 O3 M4 C M2 O3 O4 And so on. At the end of that, you would have 4 observed/imputed years per subject. There should then be a way to generalize to more than 4 per subject. Not very elegant, but it might work. --Dave ________________________________ From: Impute -- Imputations in Data Analysis [[email protected]<mailto:[email protected]>] on behalf of Paul von Hippel [[email protected]<mailto:[email protected]>] Sent: Monday, September 19, 2011 5:58 PM To: [email protected]<mailto:[email protected]> Subject: Imputing panel data, constraining correlations at long lags I have panel data where different students are tested for overlapping 2-year periods. * Subject A is observed for years 1 & 2. * Subject B is observed for years 2 & 3. * Subject C is observed for years 3 & 4. * etc up to year 12 (of school) For each observed year there are three separate test occasions (fall, winter, spring) and two subjects (reading, math). It seems to me I can impute the missing test scores provided I am willing to assume something about lags that are 2 years are longer. For example, I could assume that the partial correlation at lags of 2 years or longer is zero. This is not an unreasonable assumption since the correlations at shorter lags are very strong (.8-.9). Is there software that will allow me to do this conveniently? My usual strategy is to reshape the data from long to wide and then impute using a multivariate normal model. There are several packages that will permit this; however, I am not aware of software that will let me constrain the covariance matrix in the way I have described. I have not used imputation software that are tailored for panel data -- such as Schafer et al's PAN package, recently ported from S-Plus to R. I could try that, provided there is a convenient way to restrict the long lags. Thanks! -- Best wishes, Paul von Hippel Assistant Professor LBJ School of Public Affairs Sid Richardson Hall 3.251 University of Texas, Austin 2315 Red River, Box Y Austin, TX 78712 mobile, preferred (614) 282-8963<tel:%28614%29%20282-8963> office (512) 232-3650<tel:%28512%29%20232-3650> -- Best wishes, Paul von Hippel Assistant Professor LBJ School of Public Affairs Sid Richardson Hall 3.251 University of Texas, Austin 2315 Red River, Box Y Austin, TX 78712 mobile, preferred (614) 282-8963<tel:%28614%29%20282-8963> office (512) 232-3650<tel:%28512%29%20232-3650> -- Best wishes, Paul von Hippel Assistant Professor LBJ School of Public Affairs Sid Richardson Hall 3.251 University of Texas, Austin 2315 Red River, Box Y Austin, TX 78712 mobile, preferred (614) 282-8963 office (512) 232-3650
