I've been out of the stats game since my u/grad days but I think I have come across the need for doing some inferential stats. I'll explain my logic and see what you think.
I analyse admissions data for a few hundred hospitals. Last year (Y2) I ran their data through a piece of software that picked up problems in the diagnosis codes. When I found problems I sent the data back to the hospitals for their assessment and they had the option of correcting. The rate of error was calculated as the number of records that generated a particular error over the number that potentially could have generated the error. The previous year (Y1), their data was not run through the software. I want to show that the application of the treatment (sending the errors back) in Y2 resulted in significantly fewer errors than in Y1. Issue 1: The number of records that could potentially generate the errors (denom) at each hospital is not constant from one year to the next. For example: Hosp Y1 Denom Y1 Num Y2 Denom Y2 Num A 30017 198 31098 56 B 378 3 420 0 . . . To counter this I used the logic that if a denominator changed by 31098/30017, then with no treatment applied we should expect the numerator to change by the same amount. So I ended up with an actual Y2 Numerator and an expected Y2 Numerator. I wanted to test whether expected was significantly different to actual. Issue 2: The number of records hospitals will report is not normally (Gaussian) distributed (kurtosis 4.7, skew 2.3 for expected numerator). We have a lot of very small hospitals (few records), some medium sized and some extremely large hospitals. When I take the log of the denominators and numerators, the data fits a linear curve very nicely, so transformation seems to be in order (rsquared .85 (denom) to .95 (numerator)). BUT a number of the hospitals will have a zero value in either the expected numerator, or the actual numerator (or both). You can't take the log of zero. Zero, however, is very valuable information (the hospital had no errors in the diagnosis codes in question), and excluding the zero values would compromise the measurement of improvement. Is it reasonable to add a constant (like 1) to every value in the actual and expected numerators to get around this? To me, a simple T-test on the transformed data would either support or discredit my hypothesis that sending the errors back to the hospitals and giving them an opportunity to fix the data contributed to a statistically significantly smaller number of errors reported at the end of the second year. BUT, I'm not sure about getting around the log of zero. Regards GPO . . ================================================================= Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at: . http://jse.stat.ncsu.edu/ . =================================================================
