Re: [Statistics] Convention when outside support?

Alex Herbert Fri, 29 Nov 2019 12:46:37 -0800


> On 29 Nov 2019, at 18:24, Gilles Sadowski <gillese...@gmail.com> wrote:
> 
> Hi.
> 
> Le ven. 29 nov. 2019 à 18:41, Alex Herbert <alex.d.herb...@gmail.com 
> <mailto:alex.d.herb...@gmail.com>> a écrit :
>> 
>> On 29/11/2019 16:48, Gilles Sadowski wrote:
>>> Hello.
>>> 
>>> For all implemented distributions, what convention should be adopted
>>> when methods
>>>  * density(x)
>>>  * logDensity(x)
>>>  * cumulativeProbability(x)
>>> are called with "x" out of the "support" bounds?
>>> 
>>> Currently some (but not all[1]) are documented to return "NaN".
>>> An alternative could be to throw an exception.
>> 
>> The convention in the java.lang.Math class is to return NaN for things
>> that do not make sense, e.g.
>> 
>> Math.log(-1)
>> Math.asin(4)
> 
> But are we in the same kind of (wrong) usage when considering
> the argument to the above methods?
> I mean: If we ask the question of "What is the density at x?", is
> it really an error to reply "0" when outside the domain?


In the case of probabilities then returning 0 does not seem wrong.

It would be akin to the use of the instanceof operator where you wish to do 
something based on whether the object is of the correct type. Here you wish to 
have a probability for a value. It the value is not correct then it has no 
probability, you return zero and the caller can do any computation they want 
based on it having no probability.

As I mentioned popular R and Python implementations return zero for out of 
domain cases. So this behaviour would not be unprecendented.

I previously checked the gamma distribution. The same is true for others I’ve 
just checked, e.g. a Binomial in R:

> dbinom(-1, size=12, prob=0.2) 
[1] 0
> dbinom(44, size=12, prob=0.2) 
[1] 0

Or scipy:

>>> from scipy.stats import binom
>>> n, p = 12, 0.2
>>> binom.pmf(-1, n, p)
0.0
>>> binom.pmf(44, n, p)
0.0

> 
>> This leaves it as the responsibility of the caller to know when it may
>> be possible to pass in a bad value and so check the results.
>> 
>> It unfortunately leaves open the issue that not everyone will do that
>> and so their program can be brought to a stop by presence of NaN values
>> that may have appeared some way further back in the computation.
>> 
>> Throwing an exception seems to be the only way to preserve the stack
>> trace of where the computation went wrong.
>> 
>> So either case has merit.
>> 
>> What do other languages do? A few seem to return 0 for out of support.
>> 
>> I had a look at Python. Here there is not much consistency using scipy:
>> 
>>>>> import math
>>>>> from scipy.stats import gamma
>>>>> gamma.pdf(0.5, 1.99)
>> 0.3066586069413397
>>>>> gamma.pdf(-0.5, 1.99)
>> 0.0
>>>>> gamma.logpdf(-0.5, 1.99)
>> -inf
>>>>> math.log(0)
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> ValueError: math domain error
>> 
>> So scipy returns 0 for the density function when outside support. It
>> returns -inf for the log of zero but python's math function returns an
>> exception for the log of zero.
>> 
>> In R the behaviour is the same as python with the exception that the log
>> of zero is -Inf.
>> 
>>> dgamma(0, 2)
>> [1] 0
>>> dgamma(-1, 2)
>> [1] 0
>>> dgamma(-1, 2, log=TRUE)
>> [1] -Inf
>>> log(0)
>> [1] -Inf
>> 
>> So returning 0 is another option. However this cannot distinguish a
>> valid return of 0 from an error.
>> 
>> Note that if we did not have double as a return value then throwing an
>> exception would be the primary choice for signalling error as there is
>> no NaN for other numbers. However there are documented cases for
>> computations in the JDK which do not make sense that avoid throwing
>> exceptions as in Math.abs(int) for Integer.MIN_VALUE which still returns
>> a negative.
>> 
>> I'm not a fan of static properties to configure the behaviour either
>> way. I don't think using zero is a good idea as it cannot signal
>> something is wrong.
>> 
>> I would favour one of the following:
>> 
>> - Provide alternative methods to return NaN or throw
>> - Always return NaN (which seems more Java conventional) and provide a
>> wrapper distribution that can wrap calls to density, logDensity and
>> cumulativeProbability and throw an exception if the underlying
>> distribution returns NaN.
>> - Always throw (which forces users to safe usage) and provide a wrapper
>> distribution that can wrap calls to density, logDensity and
>> cumulativeProbability and return NaN or zero if the underlying
>> distribution throws.
>> 
>> When considering the situation where you can create a distribution with
>> a bad value and you get an exception, but you can use a distribution
>> with a bad value and you get NaN it seems to me that throwing an
>> exception may be the more sensible approach. A wrapper to guard
>> exceptions can be user configurable to return NaN or zero.
> 
> Instantiating and raising an exception is (relatively) costly.
> So if the "return NaN" feature is used in a use-case where performance
> matters, the wrapper would spoil the intended purpose.

Yes. On more reflection it would be the default to return a standard answer for 
invalid and provide a wrapper to throw if the argument is out of bounds. 
Providing a wrapper at least acknowledges that this is something people should 
consider when using the distribution classes. Do they want a zero for 
out-of-domain or do they want an exception.

> 
> Gilles
> 
>> 
>> Alex
>>> Regards,
>>> Gilles
>>> 
>>> [1] https://issues.apache.org/jira/projects/MATH/issues/MATH-1503 
>>> <https://issues.apache.org/jira/projects/MATH/issues/MATH-1503>
>>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org 
> <mailto:dev-unsubscr...@commons.apache.org>
> For additional commands, e-mail: dev-h...@commons.apache.org 
> <mailto:dev-h...@commons.apache.org>

Re: [Statistics] Convention when outside support?

Reply via email to