Re: [R] random truncation
PLEASE EXCUSE: This discussion has diverged from R into discussing the precise assumptions seemingly descriptive of an application that drove the initial post to this thread. A reply by Abby Spurdle seemed to me to raise questions, whose answers may not be intelligible without material snipped from Spurdle's reply. I wish to thank Spurdle from the reply and apologize to those who feel this is an abuse of this list. I trust that those in the latter category will please not bother to read further. For anyone still interested in this problem, below please find my earlier analysis with corrections that attempt to respond to Spurdle's most recent concerns. Thanks, Spencer Graves On 2019-07-12 22:31, Abby Spurdle wrote: > The distribution of the randomly truncated variable has thus four > parameters: a, b, mu and sigma. I was able to write down the likelihood > and attempted to maximise it I read the Wikipedia article more carefully. The formula is relatively simple, and is based on the application of Bayes Theorem. If one doesn't want to work out the integral, numerical methods can be used. However, the problem needs to be defined *precisely* first. Correct: In my case, I confess I hadn't thought this through completely before posting. I tried Rseek, as Bert Gunter suggested. That led me to the "truncreg" and "DTDA" packages, neither of which seemed to be what I wanted; thanks to Bert, Rolf, and Abby for your comments. I'm observing a random variable Y[i] = (x[i]'b+e[i]) given Y[i]>(z[i]'c+f[i]) where the tick mark (') denotes transpose of a vector, and e and f are normally distributed with mean 0 and standard deviations s and t, respectively, i = 1:n. Thus, Y[i] follows a truncated normal distribution with mean x[i]'b and standard deviation s, with the truncation condition being that Y[i]>(z[i]'c+f[i]). I want the total of all the Y's from the untruncated distribution, i.e., including those truncated (and not observed). I think the likelihood is the product of the density of Y[i] given x[i] and given that Y[i] is actually observed. By substituting Y[i] = (x[i]'b+e[i]) into the truncation condition Y[i]>(z[i]'c+f[i]), we get the following: (x[i]'b+e)>(z[i]c+f). This occurs if and only if: (x[i]'b-z[i]'c)>(f-e), Therefore, the probability that Y[i] is observed (and not truncated) is Pr{Y[i] observed} = Phi((x[i]'b-z[i]'c)/sqrt(s^2+t^2)) where Phi is the cdf of the standard normal. And then the likelihood for observation i can be written as follows: f(y[i]|x[i], z[i], b, c, s, t) = phi((y[i]-x[i]'b)/s) / Phi((x[i]'b-z[i]'c)/sqrt(s^2+t^2)). We may not be able to estimate "c" in this, because if one of the z[i]'s is nonzero, we can pick "c" so z[i]'c is Inf. That makes the denominator 0 and the likelihood Inf. (If all the z[i]'s are zero, we still cannot estimate "c".) However, if "b" is estimable, ignoring the truncation, then we can estimate "b", "s" and "t" given "c". And then the desired total of all the Y's, observed and unobserved, would be the sum of y[i] divided by Pr{Y[i] observed}. This likelihood is simple enough, it can be easily programmed in R and maximized over variations in "b", "s" and "t" given "c". I can get starting values for "b" and "s" from "lm", ignoring the truncation. And I can first fit the model assuming t = s, then test whether it's different using likelihood ratio. And I can try to estimate "c", but I should probably use values I can estimate from other sources until I'm comfortable with the estimates I get for "b", "s" and "t" given an assumed value for "c". Comments? Thanks so much. Spencer Graves __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] random truncation
Note my last response is probably off-topic. I just wanted to highlight the need for defining problems in a way that computers (and R programmers) can understand. On Sun, Jul 14, 2019 at 1:13 PM Abby Spurdle wrote: > > Firstly, we don't really need all your working. > Just the problem you want solve. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] random truncation
Firstly, we don't really need all your working. Just the problem you want solve. However, I'm still having difficulty understanding this. > I'm observing Y[i] = (X[i]'b+e) given Y[i]>(z[i]'c+f) where e and > f are normally distributed with standard deviations s and t, > respectively, i = 1:n. I want the total of all the Y's, including those > truncated (and not observed). > (x[i]'b+e)>(z[i]c+f) > (x[i]'b-z[i]'c)>(f-e), (1) What does the tick (or single quote) mean? (2) What do you mean by "X[i]" and "observing Y[i]"? (3) Are e and f random variables, or vectors of (observed) errors? My intuitive understanding of (2), in the given context, would be that X and Y are vectors of observations. However, if e and f are random variables, wouldn't that would make Y[i] a random variable too? In which case I'm not sure what you mean by "observing". Conversely, if e and f are vectors of (observed) errors, shouldn't they be indexed (for consistency)? And in which case, there's no random component in your expressions. (4) Is lower case "x[i]" the same as upper case "X[i]", and what's "z[i]"? (5) Is the mean of e and f, zero? (6) So, you want to estimate b and c? And wouldn't that make this a parameter estimation problem? (In which case, you don't necessary need to model any distributions). > Pr{Y[i] observed} = Phi((x[i]'b-z[i]'c)/sqrt(s^2+t^2)) > where Phi is the cdf of the standard normal. This implies that "Y[i]" is a is binary (true or false) random variable. Do you mean Pr (Y <= y) where Y is a random variable and y is a possible value (from Y's sample space), that Y can be less than or equal to? You can replace y with y[i] if you want, but the principle is the same. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] random truncation
> What integral? What do you mean "What integral?"... The integral on the Wikipedia page. (The same page referenced in the earlier posts). https://en.wikipedia.org/wiki/Truncated_distribution#Random_truncation https://wikimedia.org/api/rest_v1/media/math/render/svg/93717ffcd3bfa2a60d825bd71b5375ad888ceb97 Seems quite obvious to me... Did you read the Wikipedia page? > Maximising > it (numerically *of course*) turned out to be problematic. Back then. > Modern optimisers might help. Numerical methods have been around for a while... > > However, the problem needs to be defined *precisely* first. > Again I have no idea of what you are driving at here. The concept of > "random truncation" is quite precisely defined. I wasn't referring the problem of "random truncation". I was referring to the original post. In particular, how Y, S[i] and d[i] are defined. Maybe, I should be more precise when suggesting others be more precise. (My bad). I note that Spencer has posted again. I will read the new post, later, when I get some more time. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] random truncation
On 2019-07-12 22:31, Abby Spurdle wrote: > The distribution of the randomly truncated variable has thus four > parameters: a, b, mu and sigma. I was able to write down the likelihood > and attempted to maximise it I read the Wikipedia article more carefully. The formula is relatively simple, and is based on the application of Bayes Theorem. If one doesn't want to work out the integral, numerical methods can be used. However, the problem needs to be defined *precisely* first. Correct: In my case, I confess I hadn't thought this through completely before posting. I tried Rseek, as Bert Gunter suggested. That led me to the "truncreg" and "DTDA" packages, neither of which seemed to be what I wanted; thanks to Bert, Rolf, and Abby for your comments. I'm observing Y[i] = (X[i]'b+e) given Y[i]>(z[i]'c+f) where e and f are normally distributed with standard deviations s and t, respectively, i = 1:n. I want the total of all the Y's, including those truncated (and not observed). I think the likelihood is the product of the density of Y[i] given x[i] and given that Y[i] is actually observed. The latter can be further written as follows: (x[i]'b+e)>(z[i]c+f) iff (x[i]'b-z[i]'c)>(f-e), Therefore, the probability that Y[i] is observed is Pr{Y[i] observed} = Phi((x[i]'b-z[i]'c)/sqrt(s^2+t^2)) where Phi is the cdf of the standard normal. And then the likelihood for observation i can be written as follows: f(y[i]|x[i], z[i], b, c, s, t) = phi((y[i]-x[i]'b)/s) / Phi((x[i]'b-z[i]'c)/sqrt(s^2+t^2). We may not be able to estimate "c" in this, because if one of the z[i]'s is nonzero, we can pick "c" so z[i]'c is Inf. That makes the denominator 0 and the likelihood Inf. (If all the z[i]'s are zero, we still cannot estimate "c".) However, if "b" is estimable, ignoring the truncation, then we can estimate "b", "s" and "t" given "c". And then the desired total of all the Y's, observed and unobserved, would be the sum of y[i] divided by Pr{Y[i] observed}. This likelihood is simple enough, it can be easily programmed in R and maximized over variations in "b", "s" and "t" given "c". I can get starting values for "b" and "s" from "lm", ignoring the truncation. And I can first fit the model assuming t = s, then test whether it's different using likelihood ratio. And I can try to estimate "c", but I should probably use values I can estimate from other sources until I'm comfortable with the estimates I get for "b", "s" and "t" given an assumed value for "c". Comments? Thanks so much. Spencer Graves __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] random truncation
On 13/07/19 3:31 PM, Abby Spurdle wrote: > The distribution of the randomly truncated variable has thus four > parameters: a, b, mu and sigma. I was able to write down the likelihood > and attempted to maximise it I read the Wikipedia article more carefully. The formula is relatively simple, and is based on the application of Bayes Theorem. If one doesn't want to work out the integral, numerical methods can be used. I am mystified as to what you are saying here. What integral? As I said, I could write down the likelihood. Explicitly. Maximising it (numerically *of course*) turned out to be problematic. Back then. Modern optimisers might help. Presumably the DTDA package has a reasonable procedure. (I haven't looked at it.) However, the problem needs to be defined *precisely* first. Again I have no idea of what you are driving at here. The concept of "random truncation" is quite precisely defined. cheers, Rolf -- Honorary Research Fellow Department of Statistics University of Auckland Phone: +64-9-373-7599 ext. 88276 __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] random truncation
> The distribution of the randomly truncated variable has thus four > parameters: a, b, mu and sigma. I was able to write down the likelihood > and attempted to maximise it I read the Wikipedia article more carefully. The formula is relatively simple, and is based on the application of Bayes Theorem. If one doesn't want to work out the integral, numerical methods can be used. However, the problem needs to be defined *precisely* first. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] random truncation
On 13/07/19 10:54 AM, Spencer Graves wrote: Hello: What do you suggest I do about modeling random truncation? Good question! Probably the best answer is "Give up and go to the pub!" :-) But seriously, there is a package DTDA on CRAN which purports to analyse randomly truncated data. I have data on a variable Y in strata S[0], S[1], ..., S[n], where Y is always observed in S[0] but is less often observed in the other strata. I assume that the probability of observing Y is a monotonically increasing function of Y and a monotonically decreasing function of d[i] = the distance from S[0] to S[i]. There is a section on "random truncation" in the Wikipedia article on "Truncated distribution".[1] It would be nice if I had an R package that would make it relatively easy to model the truncation as a function of "d" and / or publication that described someone doing it in R. (I also have a couple of other variables that influence the distribution of Y.) Thanks, Spencer Graves [1] https://en.wikipedia.org/wiki/Truncated_distribution#Random_truncation I'd just like to add that many many years ago, probably back before you were born, I struggled mightily for a while with random truncation, modelling the truncation variable to have a density that was uniform on some interval [a,b]. I took the underlying ("untruncated") variable to have a Gaussian distribution N(mu,sigma^2). The distribution of the randomly truncated variable has thus four parameters: a, b, mu and sigma. I was able to write down the likelihood and attempted to maximise it, but this was a nightmare since the partial derivatives of the log likelihood are undefined for b=x_i where x_1, ..., x_n are the i.i.d. observations from the randomly truncated distribution. I fought with this for a long while, could not get estimates based on simulated data to agree at all well with the true values, and eventually chucked it all in. This all happened back before there was R, or even S!!! Let us hope that the authors of DTDA are far clever than I. (Not at all unlikely!) cheers, Rolf -- Honorary Research Fellow Department of Statistics University of Auckland Phone: +64-9-373-7599 ext. 88276 __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] random truncation
Did you search on e.g. "model truncation" at rseek.org? Several packages came up that appear to deal with truncated data, though I have no clue whether in the way you specify. -- Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Fri, Jul 12, 2019 at 4:26 PM Abby Spurdle wrote: > > It would be nice if I had an R > > package that would make it relatively easy to model the truncation as a > > function of "d" > > I suspect that R has everything you need, already. > However, I suspect you may need to reformulate your question to find what > you need. > > > I assume that the probability of observing Y is a > > monotonically increasing function of Y > > That's like flipping a coin, and saying the probability of observing heads > is a monotonically increasing function of heads. > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] random truncation
> It would be nice if I had an R > package that would make it relatively easy to model the truncation as a > function of "d" I suspect that R has everything you need, already. However, I suspect you may need to reformulate your question to find what you need. > I assume that the probability of observing Y is a > monotonically increasing function of Y That's like flipping a coin, and saying the probability of observing heads is a monotonically increasing function of heads. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.