Re: [OctDev] Handling NaN and NA - no need for NA

Alois Schlögl Wed, 29 Feb 2012 14:18:13 -0800

On 02/29/2012 03:57 PM, Jordi Gutiérrez Hermoso wrote:
> 2012/2/29 Alois Schloegl<alois.schlo...@ist.ac.at>:
>> if you believe, that I'm doing the NaN-tb because of a petty war you are
>> grossly mistaken. The NaN-toolbox tries to solve a real issue - and it
>> does it very well, I think. I also do not understand your issue - You do
>> not need to use the NaN-toolbox if you do not like it. So what is your
>> issue?
> The problem is that people continuously are installing all of
> Octave-Forge without paying any attention to what they are installing
> or why. This is particularly true of Octave installations for Windows
> and McOS 10. Thus a lot of users are shadowing core functions without
> really understanding the issue behind it.


For how long did Windows and Mac users install accidentally the NaN 
toolbox ? And of how many problems due to the NaN-tb have been reported 
to you ? So far non were reported to me, and I should know. So, I guess 
this is really a non-issue to the users of octave.

The only issue is that people get the warning about shadowed functions, 
and they feel insecure about it. But I see also that these warnings are 
necessary - and there is no way around it, because shadowing can be a 
bad thing. But shadowing by the NaN-tb happens only to statistical 
functions where the skipping of NaNs is well justified. (you might 
disagree, I come back to this below). The shadowing functions affect 
only the NaN-handling behavior, so the differences are minor and only a 
concern if someone relies on NaN-propagation in a few statistical 
functions. It hardly causes any problems if some users use the 
NaN-toolbox accidentally. The only problem might arise of someone uses a 
different exception handling strategy based on NaN, however these users 
should know how to deal with NaN and whether their approach is 
compatible with the NaN-tb or not.

For these reasons we've not seen any problems related to the shadowing 
by the NaN-tb. So I think this is a non-issue and does not require any 
action.


> The other problems are that you seem to be unhappy that Octave now
> warns when core functions are shadowed, and you also repeatedly insist
> that Octave core functions are wrong and are in need of being fixed by
> you.

If you feel offended by the language, let me know and make suggestions 
how to improve it. However, please take into account that  I try to 
demonstrate with the NaN-toolbox an alternative concept, that I think is 
beneficial and an improvement over the standard solution.


> Shadowing core functions is also inconvenient from the user point of
> view because to enable and disable the NaN-skipping behaviour, you
> have to load/unload a whole package, instead of a simple runtime flag
> to do this or not.
>

There is a flag flag_implicit_skip_nan()  that can switch the 
nan-skipping behavior into a nan-propagating behavior. I do not 
advertise or endorse its use, because the user does not need it; and it 
could be abused to make the code unreadable because it uses 
side-effects. The flag is there only for testing, but its there if you 
really need it.


>> Concerning your question: NA-skipping instead of NaN-tb is not a
>> solution, at least not for the NaN-toolbox for the following reason:
>>
>> o) When you compute in statistics some expectation value, it does not
>> matter whether there is a NA or a NAN, both should be skipped.
> This does not make sense to me. Why should NaN be skipped if it arose
> from an incorrect computation? It only makes sense to me to skip them
> if they are representing missing data, not if they are representing an
> incorrect computation.

With "incorrect computation", I assume you mean an operation resulting 
in an undefined value (like 0/0 or inf-inf). Yielding NaN in such cases 
is not an "incorrect" but a correct computation, and in agreement with 
IEEE754. And the meaning of NaN is that of an "undefined value", you 
might want to use this to signal an exception, but in statistics you 
will ignore it and compute the statistic from the other available samples.

Let look at an example, you have two larger vectors x and y and want to 
compute the average ratio x(k)/y(k). There might be cases were some k, 
both x(k),y(k) are zero resulting in NaN. It is reasonable to compute 
the average (i.e. the statistical mean) from the remaining samples.

The standard solution would be
(1)   m = nanmean(x./y)

or

(2)   z = x./y
       z(isnan(z))=[];
       m = mean(z)

With the NaN-toolbox, you just need:

(3)   m = mean(x./y)

I general the story ends here, the average ratio is computed, and you do 
not need anything else (thats the use case I generally observe).


Now lets see what we would gain with NA's and an NA-skipping mean() :
(4)   z = x./y
       z(isnan(z))=NA;
       m = mean(z)
I do not see any advantage of this.


Assuming that - in some rare case - you might want doing some exception 
handling that relays on the NaN-propagation. I general, this is not the 
case for the shadowed statistical functions.

The standard solution (works only w/o the NaN-tb):

(5)   m = mean(x./y)
       if isnan(m), do_exception_handling(); end;

The following solution will always work, independently whether the 
NaN-tb is installed or not:

(6)   z = x./y
       if any(isnan(z)),do_exception_handling(); end;
       m = mean(z)

However, the NaN-toolbox provides also the following functionality

(7)   m = mean(x./y)
       if flag_nans_occured(), do_exception_handling(); end;

The function flag_nans_occured(), tells you whether the input data 
contained some NaNs.  Note that this solution is as short as (4), with 
the added benefit, that m contains some estimation even if the input 
contains NaN.


Whatsoever, the point is that it's legitimate to skipping NaN's that are 
caused by a computational operation. And if we use NA's - what would we 
the gain ? Nothing.



>> - NA do not make things simpler but more complicated. There are no clear
>> rules when NA and when NAN's should be used.
> They are very clear: everything is a NaN unless the user specifically
> requests a NA.

I've never felt the need for using NA, and I've worked a lot with data 
containing missing values and NaN.
NaN's were always good enough.

>> - NA can cause a significant performance penalty. ISNAN() is supported
>> by hardware, but ISNA() needs to analyze the payload of NaN which is
>> much more complicated.
> This is a legitimate concern. Checking for NA is indeed slower by
> about a factor of ten.
>

So, why should anyone want to use NA's ?

>> Some final remarks on NA. Nobody is using it, and I really do not see
>> any advantage of NA. If NA's would provide a solution, why do the
>> statistical core functions of Octave not use it?
> They use it in R. The reason Octave has it was because there was a
> desire to have symmetrical data exchange between R and Octave. The
> reason that NA is not really used in Octave is because nobody really
> found a need to implement this behaviour until now. R has several
> functions that accept a predicate that skips NA or maybe NaN, but they
> don't skip NaN by default. If R, which is specifically tailored for
> statistics, doesn't skip NaN by default, why do you think Octave
> should?

I do not know why R needs NA's. I know that I do not see a need to 
distinguish between NaN's and NA's when I handle data with missing 
values in Octave.  And so far, the users of Octave did well without NA's .


Alois

> - Jordi G. H.



------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
_______________________________________________
Octave-dev mailing list
Octave-dev@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/octave-dev

Re: [OctDev] Handling NaN and NA - no need for NA

Reply via email to