On 08-Sep-05 John Sorkin wrote: > I have a batch of data in each line of data contains three values, > calcium score, age, and sex. I would like to predict calcium scores > as a function of age and sex, i.e. calcium=f(age,sex). Unfortunately > the calcium scorers have a very "ugly distribution". There are > multiple zeros, and multiple values between 300 and 600. There are > no values between zero and 300. Needless to say, the calcium scores > are not normally distributed, however, the values between 300 and 600 > have a distribution that is log normal. As you might imagine, the > residuals from the regression are not normally distributed and thus > violates the basic assumption of regression analyses. Does anyone > have a suggestion for a method (or a transformation) that will allow > me predict calcium from age and sex without violating the assumptions > of the model? > Thanks, > John
>From your description (but only from your description) one might be tempted to suggest (borrowing a term from Joe Shafer) a "semi-continuous" model. This means that each observation either takes a discrete value, or takes a value with a continuous distribution. In your case this might be Score = 0 with probability p which is a function of Age and Sex Score = X with probability (1-p) where X has a log-normal distribution. Whether using such a model, for data arising in the context you refer to, is reasonable depends on whether "Calcium Score = 0" is a reasonable description of a biological state of things. Even if not a reasonable biological state, it may be a reasonable description of the outcome of a measurement process (e.g. too small to measure), in which case there may be a consequential issue -- what is the likely distribution of calcium values which give rise to Score = 0? (Though your data may be uninformative about this). However, if your aim is simply predicting calcium scores, then this may be irrelevant. With such a model, you should be able to make progress by using a log-linear model for the probability p (which may be adequately addressed by simply using a logistic regression for the event "Score = 0" or equivalently "score != 0", though you may need to be careful about how you represent Age as a covariate; Sex, being binary, should not present problems). This then allowes you to predict the probability of zero score, and the complementary probability of non-zero score. Then you can consider the problem of estimating the relationship between Score and (Age, Sex) conditional on Score != 0. This, in turn, is no more (and no less!) complicated than estimating the continuous distribution of non-zero scores from the subset of the data which carries such scores. If the distribution of non-zero scores were (as you suggest) a simple log-normal distribution, then a regression of log(Score) on Age and Sex might do well. However, from your description, it may not be a simple log-normal. The absence of scores between 0 and 300, and the containment of score values betweem 300 and 600, suggests a 3-parameter log-normal in which, as well as the mean and SD for the normal distribution of log(X) there is also a lower limit S0, so that it is log(S - S0) which has the N(mean,SD^2) distribution. The distribution might be more complicated than this. So, in summary, provided a "semi-continuous" model is acceptable, you can proceed by estimating its two aspects separately: The discrete part by a logistic (or other suitable binary) regression, using 'glm' in R; the continuous part by a suitable regression (using e.g. 'lm' in R) perhaps after suitable transformation (though this may need care). In each case, it is only the relevant part of the data (the proportions with "Score = 0" and "Score != 0" on the one hand, the values of Score where "Score != 0" on the other hand, in each case using the corresponding (Age, Sex) as covariates) which will be needed. Once you have these estimated models, they can be used straightforwardly for prediction: Given Age and Sex, the Score will be zero with estimated probability p(Age,Sex) or, with probability (1 - p(Age,Sex)), will have a distribution implied by your regression. So the structure of the predicted values will be the same as the structure of the observed values. All very straightforward, provided this is a reasonable way to go. However, there is a complication in that the above might well not be a reasonable model (as hinted at above). As an example, consider the following (purely hypothetical assumptions). 1. The true distribution of Calcium Score is (say) simple log-normal such that log(Score) is normal with mean linearly dependent on Age and Sex, in all subjects. 2. In attempting to measure true Score (i.e. in obtaining observed Calcium Score data), there is a probability that "Score = 0" will be obtained, and this probability depends on the true Score (e.g. the smaller the true Score, the higher the probability of obtaining "Score = 0"). The resulting non-zero score data will then no longer have the log-normal distribution assumed in (1), since the frequency of occurrence of smaller values will be attenutated by a factor equal to the probability that such a value will result in "Score != 0". (I'm inclined to suspect, from your statement about "300-600", that this might indeed be the case.) If this is what is going on, then a different kind of approach is needed. Each "Score = 0" would in fact correspond to an unobserved non-zero value of Score, and the estimation of the distribution of true Score would be straightforward if you knew what these values were. Conditional on knowing the overall distribution, the distribution of unobserved values conditional on "Score = 0" could be obtained, and from this distribution could be derived the information you would need to estimate the distribution of true Score which you need for estimating the cinditional distribution ... In other words, we are in effect in an "EM-Algorithm" situation. This can certainly be solved in R (though I can't at this moment provide any pointers to R-implementations of a solution for your specific problem). However, it would be quite feasible for poeple to construct suggestions for solving your problem along these lines. But before people get involved in the work needed, it would be very helpful if you would respond to the comments above in terms of the real situation you are dealing with, so that we know what sort of thing we should be thinking about. Hoping this helps, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 08-Sep-05 Time: 12:01:38 ------------------------------ XFMail ------------------------------ ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html