Re: [Statistics] Convention when outside support?

2019-11-29 Thread Alex Herbert


> On 29 Nov 2019, at 18:24, Gilles Sadowski  wrote:
> 
> Hi.
> 
> Le ven. 29 nov. 2019 à 18:41, Alex Herbert  > a écrit :
>> 
>> On 29/11/2019 16:48, Gilles Sadowski wrote:
>>> Hello.
>>> 
>>> For all implemented distributions, what convention should be adopted
>>> when methods
>>>  * density(x)
>>>  * logDensity(x)
>>>  * cumulativeProbability(x)
>>> are called with "x" out of the "support" bounds?
>>> 
>>> Currently some (but not all[1]) are documented to return "NaN".
>>> An alternative could be to throw an exception.
>> 
>> The convention in the java.lang.Math class is to return NaN for things
>> that do not make sense, e.g.
>> 
>> Math.log(-1)
>> Math.asin(4)
> 
> But are we in the same kind of (wrong) usage when considering
> the argument to the above methods?
> I mean: If we ask the question of "What is the density at x?", is
> it really an error to reply "0" when outside the domain?

In the case of probabilities then returning 0 does not seem wrong.

It would be akin to the use of the instanceof operator where you wish to do 
something based on whether the object is of the correct type. Here you wish to 
have a probability for a value. It the value is not correct then it has no 
probability, you return zero and the caller can do any computation they want 
based on it having no probability.

As I mentioned popular R and Python implementations return zero for out of 
domain cases. So this behaviour would not be unprecendented.

I previously checked the gamma distribution. The same is true for others I’ve 
just checked, e.g. a Binomial in R:

> dbinom(-1, size=12, prob=0.2) 
[1] 0
> dbinom(44, size=12, prob=0.2) 
[1] 0

Or scipy:

>>> from scipy.stats import binom
>>> n, p = 12, 0.2
>>> binom.pmf(-1, n, p)
0.0
>>> binom.pmf(44, n, p)
0.0

> 
>> This leaves it as the responsibility of the caller to know when it may
>> be possible to pass in a bad value and so check the results.
>> 
>> It unfortunately leaves open the issue that not everyone will do that
>> and so their program can be brought to a stop by presence of NaN values
>> that may have appeared some way further back in the computation.
>> 
>> Throwing an exception seems to be the only way to preserve the stack
>> trace of where the computation went wrong.
>> 
>> So either case has merit.
>> 
>> What do other languages do? A few seem to return 0 for out of support.
>> 
>> I had a look at Python. Here there is not much consistency using scipy:
>> 
> import math
> from scipy.stats import gamma
> gamma.pdf(0.5, 1.99)
>> 0.3066586069413397
> gamma.pdf(-0.5, 1.99)
>> 0.0
> gamma.logpdf(-0.5, 1.99)
>> -inf
> math.log(0)
>> Traceback (most recent call last):
>>   File "", line 1, in 
>> ValueError: math domain error
>> 
>> So scipy returns 0 for the density function when outside support. It
>> returns -inf for the log of zero but python's math function returns an
>> exception for the log of zero.
>> 
>> In R the behaviour is the same as python with the exception that the log
>> of zero is -Inf.
>> 
>>> dgamma(0, 2)
>> [1] 0
>>> dgamma(-1, 2)
>> [1] 0
>>> dgamma(-1, 2, log=TRUE)
>> [1] -Inf
>>> log(0)
>> [1] -Inf
>> 
>> So returning 0 is another option. However this cannot distinguish a
>> valid return of 0 from an error.
>> 
>> Note that if we did not have double as a return value then throwing an
>> exception would be the primary choice for signalling error as there is
>> no NaN for other numbers. However there are documented cases for
>> computations in the JDK which do not make sense that avoid throwing
>> exceptions as in Math.abs(int) for Integer.MIN_VALUE which still returns
>> a negative.
>> 
>> I'm not a fan of static properties to configure the behaviour either
>> way. I don't think using zero is a good idea as it cannot signal
>> something is wrong.
>> 
>> I would favour one of the following:
>> 
>> - Provide alternative methods to return NaN or throw
>> - Always return NaN (which seems more Java conventional) and provide a
>> wrapper distribution that can wrap calls to density, logDensity and
>> cumulativeProbability and throw an exception if the underlying
>> distribution returns NaN.
>> - Always throw (which forces users to safe usage) and provide a wrapper
>> distribution that can wrap calls to density, logDensity and
>> cumulativeProbability and return NaN or zero if the underlying
>> distribution throws.
>> 
>> When considering the situation where you can create a distribution with
>> a bad value and you get an exception, but you can use a distribution
>> with a bad value and you get NaN it seems to me that throwing an
>> exception may be the more sensible approach. A wrapper to guard
>> exceptions can be user configurable to return NaN or zero.
> 
> Instantiating and raising an exception is (relatively) costly.
> So if the "return NaN" feature is used in a use-case where performance
> matters, the wrapper would spoil the intended purpose.

Yes. On more reflection it would be 

Re: [Statistics] Convention when outside support?

2019-11-29 Thread Gilles Sadowski
Hi.

Le ven. 29 nov. 2019 à 18:41, Alex Herbert  a écrit :
>
> On 29/11/2019 16:48, Gilles Sadowski wrote:
> > Hello.
> >
> > For all implemented distributions, what convention should be adopted
> > when methods
> >   * density(x)
> >   * logDensity(x)
> >   * cumulativeProbability(x)
> > are called with "x" out of the "support" bounds?
> >
> > Currently some (but not all[1]) are documented to return "NaN".
> > An alternative could be to throw an exception.
>
> The convention in the java.lang.Math class is to return NaN for things
> that do not make sense, e.g.
>
> Math.log(-1)
> Math.asin(4)

But are we in the same kind of (wrong) usage when considering
the argument to the above methods?
I mean: If we ask the question of "What is the density at x?", is
it really an error to reply "0" when outside the domain?

> This leaves it as the responsibility of the caller to know when it may
> be possible to pass in a bad value and so check the results.
>
> It unfortunately leaves open the issue that not everyone will do that
> and so their program can be brought to a stop by presence of NaN values
> that may have appeared some way further back in the computation.
>
> Throwing an exception seems to be the only way to preserve the stack
> trace of where the computation went wrong.
>
> So either case has merit.
>
> What do other languages do? A few seem to return 0 for out of support.
>
> I had a look at Python. Here there is not much consistency using scipy:
>
>  >>> import math
>  >>> from scipy.stats import gamma
>  >>> gamma.pdf(0.5, 1.99)
> 0.3066586069413397
>  >>> gamma.pdf(-0.5, 1.99)
> 0.0
>  >>> gamma.logpdf(-0.5, 1.99)
> -inf
>  >>> math.log(0)
> Traceback (most recent call last):
>File "", line 1, in 
> ValueError: math domain error
>
> So scipy returns 0 for the density function when outside support. It
> returns -inf for the log of zero but python's math function returns an
> exception for the log of zero.
>
> In R the behaviour is the same as python with the exception that the log
> of zero is -Inf.
>
>  > dgamma(0, 2)
> [1] 0
>  > dgamma(-1, 2)
> [1] 0
>  > dgamma(-1, 2, log=TRUE)
> [1] -Inf
>  > log(0)
> [1] -Inf
>
> So returning 0 is another option. However this cannot distinguish a
> valid return of 0 from an error.
>
> Note that if we did not have double as a return value then throwing an
> exception would be the primary choice for signalling error as there is
> no NaN for other numbers. However there are documented cases for
> computations in the JDK which do not make sense that avoid throwing
> exceptions as in Math.abs(int) for Integer.MIN_VALUE which still returns
> a negative.
>
> I'm not a fan of static properties to configure the behaviour either
> way. I don't think using zero is a good idea as it cannot signal
> something is wrong.
>
> I would favour one of the following:
>
> - Provide alternative methods to return NaN or throw
> - Always return NaN (which seems more Java conventional) and provide a
> wrapper distribution that can wrap calls to density, logDensity and
> cumulativeProbability and throw an exception if the underlying
> distribution returns NaN.
> - Always throw (which forces users to safe usage) and provide a wrapper
> distribution that can wrap calls to density, logDensity and
> cumulativeProbability and return NaN or zero if the underlying
> distribution throws.
>
> When considering the situation where you can create a distribution with
> a bad value and you get an exception, but you can use a distribution
> with a bad value and you get NaN it seems to me that throwing an
> exception may be the more sensible approach. A wrapper to guard
> exceptions can be user configurable to return NaN or zero.

Instantiating and raising an exception is (relatively) costly.
So if the "return NaN" feature is used in a use-case where performance
matters, the wrapper would spoil the intended purpose.

Gilles

>
> Alex
> > Regards,
> > Gilles
> >
> > [1] https://issues.apache.org/jira/projects/MATH/issues/MATH-1503
> >

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [Statistics] Convention when outside support?

2019-11-29 Thread Alex Herbert

On 29/11/2019 16:48, Gilles Sadowski wrote:

Hello.

For all implemented distributions, what convention should be adopted
when methods
  * density(x)
  * logDensity(x)
  * cumulativeProbability(x)
are called with "x" out of the "support" bounds?

Currently some (but not all[1]) are documented to return "NaN".
An alternative could be to throw an exception.


The convention in the java.lang.Math class is to return NaN for things 
that do not make sense, e.g.


Math.log(-1)
Math.asin(4)

This leaves it as the responsibility of the caller to know when it may 
be possible to pass in a bad value and so check the results.


It unfortunately leaves open the issue that not everyone will do that 
and so their program can be brought to a stop by presence of NaN values 
that may have appeared some way further back in the computation.


Throwing an exception seems to be the only way to preserve the stack 
trace of where the computation went wrong.


So either case has merit.

What do other languages do? A few seem to return 0 for out of support.

I had a look at Python. Here there is not much consistency using scipy:

>>> import math
>>> from scipy.stats import gamma
>>> gamma.pdf(0.5, 1.99)
0.3066586069413397
>>> gamma.pdf(-0.5, 1.99)
0.0
>>> gamma.logpdf(-0.5, 1.99)
-inf
>>> math.log(0)
Traceback (most recent call last):
  File "", line 1, in 
ValueError: math domain error

So scipy returns 0 for the density function when outside support. It 
returns -inf for the log of zero but python's math function returns an 
exception for the log of zero.


In R the behaviour is the same as python with the exception that the log 
of zero is -Inf.


> dgamma(0, 2)
[1] 0
> dgamma(-1, 2)
[1] 0
> dgamma(-1, 2, log=TRUE)
[1] -Inf
> log(0)
[1] -Inf

So returning 0 is another option. However this cannot distinguish a 
valid return of 0 from an error.


Note that if we did not have double as a return value then throwing an 
exception would be the primary choice for signalling error as there is 
no NaN for other numbers. However there are documented cases for 
computations in the JDK which do not make sense that avoid throwing 
exceptions as in Math.abs(int) for Integer.MIN_VALUE which still returns 
a negative.


I'm not a fan of static properties to configure the behaviour either 
way. I don't think using zero is a good idea as it cannot signal 
something is wrong.


I would favour one of the following:

- Provide alternative methods to return NaN or throw
- Always return NaN (which seems more Java conventional) and provide a 
wrapper distribution that can wrap calls to density, logDensity and 
cumulativeProbability and throw an exception if the underlying 
distribution returns NaN.
- Always throw (which forces users to safe usage) and provide a wrapper 
distribution that can wrap calls to density, logDensity and 
cumulativeProbability and return NaN or zero if the underlying 
distribution throws.


When considering the situation where you can create a distribution with 
a bad value and you get an exception, but you can use a distribution 
with a bad value and you get NaN it seems to me that throwing an 
exception may be the more sensible approach. A wrapper to guard 
exceptions can be user configurable to return NaN or zero.


Alex

Regards,
Gilles

[1] https://issues.apache.org/jira/projects/MATH/issues/MATH-1503

-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org



Re: [Statistics] Convention when outside support?

2019-11-29 Thread Fran Lattanzio
Hi,

I was involved in a similar debate on a different project, and we came to the 
conclusion that (double -> double) methods in Java should return NaN in the 
case of invalid arguments, rather than throw Exceptions. 

Our reasoning was by analogy with how IEEE 754 floating-point exceptions are 
handled by Java. Obviously, the definition of a floating-point exception is 
quite different from a Java exception. But anyway, our question was, how should 
raising an exception in the floating-point world map to throwing an exception 
in Java? The core Java libraries effectively behave as if all floating-point 
traps are disabled*: Overflow results in an infinity, underflow in 
subnormal/zero, square root of negative returns NaN, etc.

Based on this, we decided that returning NaN is the “best” behavior, since this 
is what IEEE spec says to do when in the invalid operation flag is disabled.

Fran.

* = We did discuss having a kind of floating-point signal policy that would 
change the behavior from returning a default value to throwing a (Java) 
exception when these floating-point exceptions were detected. But this would be 
a complex implementation problem, not least because incorporating this into 
existing numerical libraries would be difficult to impossible.



> On Nov 29, 2019, at 11:48 AM, Gilles Sadowski  wrote:
> 
> Hello.
> 
> For all implemented distributions, what convention should be adopted
> when methods
> * density(x)
> * logDensity(x)
> * cumulativeProbability(x)
> are called with "x" out of the "support" bounds?
> 
> Currently some (but not all[1]) are documented to return "NaN".
> An alternative could be to throw an exception.
> 
> Regards,
> Gilles
> 
> [1] https://issues.apache.org/jira/projects/MATH/issues/MATH-1503
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org