Re: [R] random truncation

2019-07-13 Thread Spencer Graves
PLEASE EXCUSE:  This discussion has diverged from R into discussing the 
precise assumptions seemingly descriptive of an application that drove 
the initial post to this thread.  A reply by Abby Spurdle seemed to  me 
to raise questions, whose answers may not be intelligible without 
material snipped from Spurdle's reply.  I wish to thank Spurdle from the 
reply and apologize to those who feel this is an abuse of this list.  I 
trust that those in the latter category will please not bother to read 
further.  For anyone still interested in this problem, below please find 
my earlier analysis with corrections that attempt to respond to 
Spurdle's most recent concerns.  Thanks, Spencer Graves



On 2019-07-12 22:31, Abby Spurdle wrote:

> The distribution of the randomly truncated variable has thus four
> parameters: a, b, mu and sigma.  I was able to write down the likelihood
> and attempted to maximise it

I read the Wikipedia article more carefully.
The formula is relatively simple, and is based on the application of 
Bayes Theorem.
If one doesn't want to work out the integral, numerical methods can be 
used.


However, the problem needs to be defined *precisely* first.



Correct:  In my case, I confess I hadn't thought this through completely 
before posting.  I tried Rseek, as Bert Gunter suggested.  That led me 
to the "truncreg" and "DTDA" packages, neither of which seemed to be 
what I wanted;  thanks to Bert, Rolf, and Abby for your comments.



  I'm observing a random variable Y[i] = (x[i]'b+e[i]) given 
Y[i]>(z[i]'c+f[i]) where the tick mark (') denotes transpose of a 
vector, and e and f are normally distributed with mean 0 and standard 
deviations s and t, respectively, i = 1:n.  Thus, Y[i] follows a 
truncated normal distribution with mean x[i]'b and standard deviation s, 
with the truncation condition being that Y[i]>(z[i]'c+f[i]).



  I want the total of all the Y's from the untruncated 
distribution, i.e., including those truncated (and not observed).



  I think the likelihood is the product of the density of Y[i] 
given x[i] and given that Y[i] is actually observed.  By substituting 
Y[i] = (x[i]'b+e[i]) into the truncation condition Y[i]>(z[i]'c+f[i]), 
we get the following:



        (x[i]'b+e)>(z[i]c+f).


This occurs if and only if:


        (x[i]'b-z[i]'c)>(f-e),


Therefore, the probability that Y[i] is observed (and not truncated) is


        Pr{Y[i] observed} = Phi((x[i]'b-z[i]'c)/sqrt(s^2+t^2))


where Phi is the cdf of the standard normal.


  And then the likelihood for observation i can be written as follows:


        f(y[i]|x[i], z[i], b, c, s, t) = phi((y[i]-x[i]'b)/s) / 
Phi((x[i]'b-z[i]'c)/sqrt(s^2+t^2)).



  We may not be able to estimate "c" in this, because if one of the 
z[i]'s is nonzero, we can pick "c" so z[i]'c is Inf.  That makes the 
denominator 0 and the likelihood Inf.  (If all the z[i]'s are zero, we 
still cannot estimate "c".)  However, if "b" is estimable, ignoring the 
truncation, then we can estimate "b", "s" and "t" given "c".



  And then the desired total of all the Y's, observed and 
unobserved, would be the sum of y[i] divided by Pr{Y[i] observed}.



  This likelihood is simple enough, it can be easily programmed in 
R and maximized over variations in "b", "s" and "t" given "c".  I can 
get starting values for "b" and "s" from "lm", ignoring the truncation.  
And I can first fit the model assuming t = s, then test whether it's 
different using likelihood ratio.  And I can try to estimate "c", but I 
should probably use values I can estimate from other sources until I'm 
comfortable with the estimates I get for "b", "s" and "t" given an 
assumed value for "c".



  Comments?
  Thanks so much.
  Spencer Graves

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] random truncation

2019-07-13 Thread Abby Spurdle
Note my last response is probably off-topic.
I just wanted to highlight the need for defining problems in a way that
computers (and R programmers) can understand.

On Sun, Jul 14, 2019 at 1:13 PM Abby Spurdle  wrote:
>
> Firstly, we don't really need all your working.
> Just the problem you want solve.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] random truncation

2019-07-13 Thread Abby Spurdle
Firstly, we don't really need all your working.
Just the problem you want solve.

However, I'm still having difficulty understanding this.

> I'm observing Y[i] = (X[i]'b+e) given Y[i]>(z[i]'c+f) where e and
> f are normally distributed with standard deviations s and t,
> respectively, i = 1:n.  I want the total of all the Y's, including those
> truncated (and not observed).

> (x[i]'b+e)>(z[i]c+f)
> (x[i]'b-z[i]'c)>(f-e),

(1) What does the tick (or single quote) mean?
(2) What do you mean by "X[i]" and "observing Y[i]"?
(3) Are e and f random variables, or vectors of (observed) errors?
My intuitive understanding of (2), in the given context, would be that X
and Y are vectors of observations.
However, if e and f are random variables, wouldn't that would make Y[i] a
random variable too? In which case I'm not sure what you mean by
"observing".
Conversely, if e and f are vectors of (observed) errors, shouldn't they be
indexed (for consistency)? And in which case, there's no random component
in your expressions.
(4) Is lower case "x[i]" the same as upper case "X[i]", and what's "z[i]"?
(5) Is the mean of e and f, zero?
(6) So, you want to estimate b and c?
And wouldn't that make this a parameter estimation problem?
(In which case, you don't necessary need to model any distributions).

> Pr{Y[i] observed} = Phi((x[i]'b-z[i]'c)/sqrt(s^2+t^2))
> where Phi is the cdf of the standard normal.

This implies that "Y[i]" is a is binary (true or false) random variable.

Do you mean Pr (Y <= y) where Y is a random variable and y is a possible
value (from Y's sample space), that Y can be less than or equal to?
You can replace y with y[i] if you want, but the principle is the same.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] random truncation

2019-07-13 Thread Abby Spurdle
> What integral?

What do you mean "What integral?"...
The integral on the Wikipedia page.
(The same page referenced in the earlier posts).

https://en.wikipedia.org/wiki/Truncated_distribution#Random_truncation
https://wikimedia.org/api/rest_v1/media/math/render/svg/93717ffcd3bfa2a60d825bd71b5375ad888ceb97

Seems quite obvious to me...
Did you read the Wikipedia page?

> Maximising
> it (numerically *of course*) turned out to be problematic.  Back then.
> Modern optimisers might help.

Numerical methods have been around for a while...

> > However, the problem needs to be defined *precisely* first.

> Again I have no idea of what you are driving at here.  The concept of
> "random truncation" is quite precisely defined.

I wasn't referring the problem of "random truncation".
I was referring to the original post.
In particular, how Y, S[i] and d[i] are defined.

Maybe, I should be more precise when suggesting others be more precise.
(My bad).

I note that Spencer has posted again.
I will read the new post, later, when I get some more time.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] random truncation

2019-07-13 Thread Spencer Graves




On 2019-07-12 22:31, Abby Spurdle wrote:

> The distribution of the randomly truncated variable has thus four
> parameters: a, b, mu and sigma.  I was able to write down the likelihood
> and attempted to maximise it

I read the Wikipedia article more carefully.
The formula is relatively simple, and is based on the application of 
Bayes Theorem.
If one doesn't want to work out the integral, numerical methods can be 
used.


However, the problem needs to be defined *precisely* first.



Correct:  In my case, I confess I hadn't thought this through completely 
before posting.  I tried Rseek, as Bert Gunter suggested.  That led me 
to the "truncreg" and "DTDA" packages, neither of which seemed to be 
what I wanted;  thanks to Bert, Rolf, and Abby for your comments.



  I'm observing Y[i] = (X[i]'b+e) given Y[i]>(z[i]'c+f) where e and 
f are normally distributed with standard deviations s and t, 
respectively, i = 1:n.  I want the total of all the Y's, including those 
truncated (and not observed).



  I think the likelihood is the product of the density of Y[i] 
given x[i] and given that Y[i] is actually observed.  The latter can be 
further written as follows:



        (x[i]'b+e)>(z[i]c+f)
iff
        (x[i]'b-z[i]'c)>(f-e),


Therefore, the probability that Y[i] is observed is


        Pr{Y[i] observed} = Phi((x[i]'b-z[i]'c)/sqrt(s^2+t^2))


where Phi is the cdf of the standard normal.


  And then the likelihood for observation i can be written as follows:


        f(y[i]|x[i], z[i], b, c, s, t) = phi((y[i]-x[i]'b)/s) / 
Phi((x[i]'b-z[i]'c)/sqrt(s^2+t^2).



  We may not be able to estimate "c" in this, because if one of the 
z[i]'s is nonzero, we can pick "c" so z[i]'c is Inf.  That makes the 
denominator 0 and the likelihood Inf.  (If all the z[i]'s are zero, we 
still cannot estimate "c".)  However, if "b" is estimable, ignoring the 
truncation, then we can estimate "b", "s" and "t" given "c".



  And then the desired total of all the Y's, observed and 
unobserved, would be the sum of y[i] divided by Pr{Y[i] observed}.



  This likelihood is simple enough, it can be easily programmed in 
R and maximized over variations in "b", "s" and "t" given "c".  I can 
get starting values for "b" and "s" from "lm", ignoring the truncation.  
And I can first fit the model assuming t = s, then test whether it's 
different using likelihood ratio.  And I can try to estimate "c", but I 
should probably use values I can estimate from other sources until I'm 
comfortable with the estimates I get for "b", "s" and "t" given an 
assumed value for "c".



  Comments?
  Thanks so much.
  Spencer Graves

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] random truncation

2019-07-12 Thread Rolf Turner



On 13/07/19 3:31 PM, Abby Spurdle wrote:


 > The distribution of the randomly truncated variable has thus four
 > parameters: a, b, mu and sigma.  I was able to write down the likelihood
 > and attempted to maximise it

I read the Wikipedia article more carefully.
The formula is relatively simple, and is based on the application of 
Bayes Theorem.

If one doesn't want to work out the integral, numerical methods can be used.


I am mystified as to what you are saying here.  What integral?
As I said, I could write down the likelihood.  Explicitly.  Maximising 
it (numerically *of course*) turned out to be problematic.  Back then.
Modern optimisers might help.  Presumably the DTDA package has a 
reasonable procedure.  (I haven't looked at it.)



However, the problem needs to be defined *precisely* first.


Again I have no idea of what you are driving at here.  The concept of 
"random truncation" is quite precisely defined.


cheers,

Rolf

--
Honorary Research Fellow
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] random truncation

2019-07-12 Thread Abby Spurdle
> The distribution of the randomly truncated variable has thus four
> parameters: a, b, mu and sigma.  I was able to write down the likelihood
> and attempted to maximise it

I read the Wikipedia article more carefully.
The formula is relatively simple, and is based on the application of Bayes
Theorem.
If one doesn't want to work out the integral, numerical methods can be used.

However, the problem needs to be defined *precisely* first.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] random truncation

2019-07-12 Thread Rolf Turner



On 13/07/19 10:54 AM, Spencer Graves wrote:


Hello:


   What do you suggest I do about modeling random truncation?


Good question!  Probably the best answer is "Give up and go to the pub!" :-)

But seriously, there is a package DTDA on CRAN which purports to analyse 
randomly truncated data.




    I have data on a variable Y in strata S[0], S[1], ..., S[n], 
where Y is always observed in S[0] but is less often observed in the 
other strata.  I assume that the probability of observing Y is a 
monotonically increasing function of Y and a monotonically decreasing 
function of d[i] = the distance from S[0] to S[i].



   There is a section on "random truncation" in the Wikipedia 
article on "Truncated distribution".[1]  It would be nice if I had an R 
package that would make it relatively easy to model the truncation as a 
function of "d" and / or publication that described someone doing it in 
R.  (I also have a couple of other variables that influence the 
distribution of Y.)



   Thanks,
   Spencer Graves


[1] https://en.wikipedia.org/wiki/Truncated_distribution#Random_truncation


I'd just like to add that many many years ago, probably back before you 
were born, I struggled mightily for a while with random truncation, 
modelling the truncation variable to have a density that was uniform on 
some interval [a,b].


I took the underlying ("untruncated") variable to have a Gaussian 
distribution N(mu,sigma^2).


The distribution of the randomly truncated variable has thus four 
parameters: a, b, mu and sigma.  I was able to write down the likelihood 
and attempted to maximise it, but this was a nightmare since the partial 
derivatives of the log likelihood are undefined for b=x_i where x_1, 
..., x_n are the i.i.d. observations from the randomly truncated 
distribution.  I fought with this for a long while, could not get 
estimates based on simulated data to agree at all well with the true 
values, and eventually chucked it all in.


This all happened back before there was R, or even S!!!

Let us hope that the authors of DTDA are far clever than I.  (Not at all 
unlikely!)


cheers,

Rolf

--
Honorary Research Fellow
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] random truncation

2019-07-12 Thread Bert Gunter
Did you search on e.g. "model truncation" at rseek.org? Several packages
came up that appear to deal with truncated data, though I have no clue
whether in the way you specify.

-- Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Fri, Jul 12, 2019 at 4:26 PM Abby Spurdle  wrote:

> > It would be nice if I had an R
> > package that would make it relatively easy to model the truncation as a
> > function of "d"
>
> I suspect that R has everything you need, already.
> However, I suspect you may need to reformulate your question to find what
> you need.
>
> > I assume that the probability of observing Y is a
> > monotonically increasing function of Y
>
> That's like flipping a coin, and saying the probability of observing heads
> is a monotonically increasing function of heads.
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] random truncation

2019-07-12 Thread Abby Spurdle
> It would be nice if I had an R
> package that would make it relatively easy to model the truncation as a
> function of "d"

I suspect that R has everything you need, already.
However, I suspect you may need to reformulate your question to find what
you need.

> I assume that the probability of observing Y is a
> monotonically increasing function of Y

That's like flipping a coin, and saying the probability of observing heads
is a monotonically increasing function of heads.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.