Hi Larry, Thanks for your question. It is hard to give general advice here, but sometimes I find it helpful to ask myself, "What data would I like to have if it cost nothing?" If the variable is "Pregnancy status" in a sample of men, the answer is that I wouldn't want any data. In other situations you may want data that are missing. In those situations, imputation can be helpful. Does this help? Thanks.
-Juned -----Original Message----- From: Impute -- Imputations in Data Analysis [mailto:[email protected]] On Behalf Of Hunsicker, Lawrence Sent: Friday, March 29, 2013 10:10 AM To: [email protected] Subject: How to handle data missing before or after a date because the data collection instrument changed Good morning, all: Frank Harrell has suggested that I join this group to get your input into a question that I have about the best way to handle missing data that are missing simply because there was, at the time, no intent to capture these data. I am involved in the analysis of data collected by the Scientific Registry of Transplant Recipients (SRTR), which obtains data from the Organ Procurement and Transplantation Network (OPTN) on all US solid organ transplants. This data collection system has been in place since October 1987, and as you would expect, there has been evolution of the data elements that are collected. New items may be added, and other items may be deleted. Because of this, the data obtained over the total period will be missing for these new or deleted variables on a large fraction of the cases. While the data are "missing completely at random" from the individual case point of view, there will of course be correlation of the missingness with date of transplant, and the missingness of various elements will be strongly correlated with one another. I wonder if a case can be made that this situation is analogous to some extent to the situation where the value for a specific variable is "Not relevant," such as for questions about pregnancy for males. In the past, I have handled these variables by creating for the categorical variables (the majority of them) a category "not collected" . This distinguishes this sort of missingness from the situation where the data "should" have been collected, but is missing for some other reason. But Frank has convinced me that this approach will likely bias both the covariance matrices and the estimates of precision of the estimated model variable coefficients. I can, of course, use multiple imputation. This is probably the most "correct" approach. But because of the size of the dataset (about 200,000 transplants), the computational expense is non-trivial. This can't be a problem unique to my situation. What are your thoughts and recommendations? Thanks in advance for your thoughts on this matter. Larry Hunsicker Prof. Medicine U. Iowa College of Medicine ________________________________ Notice: This UI Health Care e-mail (including attachments) is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is confidential and may be legally privileged. If you are not the intended recipient, you are hereby notified that any retention, dissemination, distribution, or copying of this communication is strictly prohibited. Please reply to the sender that you have received the message in error, then delete it. Thank you. ________________________________
