Re: [Numpy-discussion] min() of array containing NaN
> > Availability of the NaN functionality in a method of ndarray > > The last point is key. The NaN behavior is central to analyzing real > data containing unavoidable bad values, which is the bread and butter > of a substantial fraction of the user base. In the languages they're > switching from, handling NaNs is just part of doing business, and is > an option of every relevant routine; there's no need for redundant > sets of routines. In contrast, numpy appears to consider data > analysis to be secondary, somehow, to pure math, and takes the NaN > functionality out of routines like min() and std(). This means it's > not possible to use many ndarray methods. If we're ready to handle a > NaN by returning it, why not enable the more useful behavior of > ignoring it, at user discretion? > Maybe I missed this somewhere, but this seems like a better use for masked arrays, not NaN's. Masked arrays were specifically designed to add functions that work well with masked/invalid data points. Why reinvent the wheel here? Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
> If you're willing to do arithmetic you might even be able to > pull it off, since NaNs tend to propagate: > if (new Whether the speed of this is worth its impenetrability I couldn't say. Code comments cure impenetrability, and have no cost in speed. One could write a paragraph explaining it (if it really needed that much). The comments could even reference the current discussion. --jh-- ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
Anne Archibald: > Sadly, it's not possible without extra overhead. Specifically: the > NaN-ignorant implementation does a single comparison between each > array element and a placeholder, and decides based on the result which > to keep. Did my example code go through? The test for NaN only needs to be done when a new min value is found, which will occur something like O(log(n)) in a randomly distributed array. (Here's the hand-waving. The first requires a NaN check. The second has a 1/2 chance of being the new minimum. The third has a 1/3 chance, etc. The sum of the harmonic series goes as O(ln(n)).) This depends on a double inverting so the test for a new min value and a test for NaN occur at the same time. Here's pseudocode: best = array[0] if isnan(best): return best for item in array[1:]: if !(best <= item): best = item if isnan(best): return best return item > If you're willing to do two tests, sure, but that's overhead (and > probably comparable to an isnan). In Python the extra inversion costs an extra PVM instruction. In C by comparison the resulting assembly code for "best > item" and "!(best <= item)" have identical lengths, with no real performance difference. There's no extra cost for doing the extra inversion in the common case, and for large arrays the ratio of (NaN check) / (no check) -> 1.0. > What do compilers' min builtins do with NaNs? This might well be > faster than an if statement even in the absence of NaNs... This comes from a g++ implementation of min: /** * @brief This does what you think it does. * @param a A thing of arbitrary type. * @param b Another thing of arbitrary type. * @return The lesser of the parameters. * * This is the simple classic generic implementation. It will work on * temporary expressions, since they are only evaluated once, unlike a * preprocessor macro. */ template inline const _Tp& min(const _Tp& __a, const _Tp& __b) { // concept requirements __glibcxx_function_requires(_LessThanComparableConcept<_Tp>) //return __b < __a ? __b : __a; if (__b < __a) return __b; return __a; } The isnan function another version of gcc uses a bunch of #defs, leading to static __inline__ int __inline_isnanf( float __x ) { return __x != __x; } static __inline__ int __inline_isnand( double __x ) { return __x != __x; } static __inline__ int __inline_isnan( long double __x ) { return __x != __x; } Andrew [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
On 2008-08-14, Joe Harrington <[EMAIL PROTECTED]> wrote: >> I'm doing nothing. Someone else must volunteer. > > Fair enough. Would the code be accepted if contributed? Like I said, I would be amenable to such a change. The other developers haven't weighed in on this particular proposal, but I suspect they will agree with me. >> There is a >> reasonable design rule that if you have a boolean argument which you >> expect to only be passed literal Trues and Falses, you should instead >> just have two different functions. > > Robert, can you list some reasons to favor this design rule? nanmin(x) vs. min(x, nan=True) A boolean argument that will almost always take literal Trues and Falses basically is just a switch between different functionality. The usual mechanism for the programmer to pick between different functionality is to use the appropriate function. The =True is extraneous, and puts important semantic information last rather than at the front. > Here are some reasons to favor richly functional routines: > > User's code is more readable because subtle differences affect args, >not functions This isn't subtle. > Easier learning for new users You have no evidence of this. > Much briefer and more readable docs Briefer is possible. More readable is debatable. "Much" is overstating the case. > Similar behavior across languages This is not, has never been, and never will be a goal. Similar behavior happens because of convergent design constraints and occasionally laziness, never for it's own sake. > Smaller number of functions in the core package (a recent list topic) In general, this is a reasonable concern that must be traded off with the other concerns. In this particular case, it has no weight. nanmin() and nanmax() already exist. > Many fewer routines to maintain, particularly if multiple switches exist Again, in this case, neither of these are relevant. Yes, if there are multiple boolean switches, it might make sense to keep them all into the same function. Typically, these switches will also be affecting the semantics only in minor details, too. > Availability of the NaN functionality in a method of ndarray Point, but see below. > The last point is key. The NaN behavior is central to analyzing real > data containing unavoidable bad values, which is the bread and butter > of a substantial fraction of the user base. In the languages they're > switching from, handling NaNs is just part of doing business, and is > an option of every relevant routine; there's no need for redundant > sets of routines. In contrast, numpy appears to consider data > analysis to be secondary, somehow, to pure math, and takes the NaN > functionality out of routines like min() and std(). This means it's > not possible to use many ndarray methods. If we're ready to handle a > NaN by returning it, why not enable the more useful behavior of > ignoring it, at user discretion? Let's get something straight. numpy has no opinion on the primacy of data analysis tasks versus "pure math", however you want to define those. Now, the numpy developers *do* tend to have an opinion on how NaNs are used. NaNs were invented to handle invalid results of *computations*. They were not invented as place markers for missing data. They can frequently be used as such because the IEEE-754 semantics of NaNs sometimes works for missing data (e.g. in z=x+y, z will have a NaN wherever either x or y have NaNs). But at least as frequently, they don't, and other semantics need to be specifically placed on top of it (e.g. nanmin()). numpy is a general purpose computational tool that needs to apply to many different fields and use cases. Consequently, when presented with a choice like this, we tend to go for the path that makes the minimum of assumptions and overlaid semantics. Now to address the idea that all of the relevant ndarray methods should take nan=True arguments. I am sympathetic to the idea that we should have the functionality somewhere. I do doubt that the users you are thinking about will be happy adding nan=True to a substantial fraction of their calls. My experience with such APIs is that it gets tedious real fast. Instead, I would suggest that if you want a wide range of nan-skipping versions of functions that we have, let's put them all as functions into a module. This gives the programmer the possibility of using relatively clean calls. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
2008/8/14 Norbert Nemec <[EMAIL PROTECTED]>: > Travis E. Oliphant wrote: >> NAN's don't play well with comparisons because comparison with them is >> undefined.See numpy.nanmin >> > This is not true! Each single comparison with a NaN has a well defined > outcome. The difficulty is only that certain logical assumptions do not > hold any more when NaNs are involved (e.g. [A [not(A>=B)]). Assuming an IEEE compliant processor and C compiler, it > should be possible to code a NaN safe min routine without additional > overhead. Sadly, it's not possible without extra overhead. Specifically: the NaN-ignorant implementation does a single comparison between each array element and a placeholder, and decides based on the result which to keep. If you try to rewrite the comparison to do the right thing when a NaN is involved, you get stuck: any comparison with a NaN on either side always returns False, so you cannot distinguish between the temporary being a NaN and the new element being a non-NaN (keep the temporary) and the temporary being a non-NaN and the new element being a NaN (replace the temporary). If you're willing to do two tests, sure, but that's overhead (and probably comparable to an isnan). If you're willing to do arithmetic you might even be able to pull it off, since NaNs tend to propagate: if (newhttp://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
Travis E. Oliphant wrote: > Thomas J. Duck wrote: > >> Determining the minimum value of an array that contains NaN produces >> a surprising result: >> >> >>> x = numpy.array([0,1,2,numpy.nan,4,5,6]) >> >>> x.min() >> 4.0 >> >> I expected 0.0. Is this the intended behaviour or a bug? I am using >> numpy 1.1.1. >> >> > NAN's don't play well with comparisons because comparison with them is > undefined.See numpy.nanmin > This is not true! Each single comparison with a NaN has a well defined outcome. The difficulty is only that certain logical assumptions do not hold any more when NaNs are involved (e.g. [A=B)]). Assuming an IEEE compliant processor and C compiler, it should be possible to code a NaN safe min routine without additional overhead. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
> I'm doing nothing. Someone else must volunteer. Fair enough. Would the code be accepted if contributed? > There is a > reasonable design rule that if you have a boolean argument which you > expect to only be passed literal Trues and Falses, you should instead > just have two different functions. Robert, can you list some reasons to favor this design rule? Here are some reasons to favor richly functional routines: User's code is more readable because subtle differences affect args, not functions Easier learning for new users Much briefer and more readable docs Similar behavior across languages Smaller number of functions in the core package (a recent list topic) Many fewer routines to maintain, particularly if multiple switches exist Availability of the NaN functionality in a method of ndarray The last point is key. The NaN behavior is central to analyzing real data containing unavoidable bad values, which is the bread and butter of a substantial fraction of the user base. In the languages they're switching from, handling NaNs is just part of doing business, and is an option of every relevant routine; there's no need for redundant sets of routines. In contrast, numpy appears to consider data analysis to be secondary, somehow, to pure math, and takes the NaN functionality out of routines like min() and std(). This means it's not possible to use many ndarray methods. If we're ready to handle a NaN by returning it, why not enable the more useful behavior of ignoring it, at user discretion? --jh-- ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
On Wed, Aug 13, 2008 at 4:01 PM, Robert Kern <[EMAIL PROTECTED]> wrote: > On Wed, Aug 13, 2008 at 14:37, Joe Harrington <[EMAIL PROTECTED]> wrote: > >>On Tue, Aug 12, 2008 at 19:28, Charles R Harris > >><[EMAIL PROTECTED]> wrote: > >>> > >>> > >>> On Tue, Aug 12, 2008 at 5:13 PM, Andrew Dalke < > [EMAIL PROTECTED]> > >>> wrote: > > On Aug 12, 2008, at 9:54 AM, Anne Archibald wrote: > > Er, is this actually a bug? I would instead consider the fact that > > np.min([]) raises an exception a bug of sorts - the identity of min > is > > inf. > >>> > >>> > >>> > > Personally, I expect that if my array 'x' has a NaN then > min(x) must be a NaN. > >>> > >>> I suppose you could use > >>> > >>> min(a,b) = (abs(a - b) + a + b)/2 > >>> > >>> which would have that effect. > > > >>Or we could implement the inner loop of the minimum ufunc to return > >>NaN if there is a NaN. Currently it just compares the two values > >>(which causes the unpredictable results since having a NaN on either > >>side of the < is always False). I would be amenable to that provided > >>that the C isnan() call does not cause too much slowdown in the normal > >>case. > > > > While you're doing that, can you do it so that if keyword nan=False it > > returns NaN if NaNs exist, and if keyword nan=True it ignores NaNs? > > I'm doing nothing. Someone else must volunteer. > I've volunteered to implement this functionality and will have some time over the weekend to prepare and post a patch for further discussion. -Kevin ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
Robert Kern wrote: > Or we could implement the inner loop of the minimum ufunc to return > NaN if there is a NaN. Currently it just compares the two values > (which causes the unpredictable results since having a NaN on either > side of the < is always False). I would be amenable to that provided > that the C isnan() call does not cause too much slowdown in the normal > case. Reading this again, I realize that I don't know how ufuncs work so this suggestion might not be feasible. It doesn't need to be unpredictable. Make sure the first value is not a NaN (if it is, quit). The test against NaN always returns false, so by inverting the comparison then inverting the result you end up with a test for "is a new minimum OR is NaN". (I checked the assembly output. There's no effective different in code length between the normal and the inverted forms. I didn't test performance.) For random values in the array the test should pass less and less often, so sticking the isnan test in there has something like O(log(N)) cost instead of O(N) cost. That's handwaving, btw, but it's probably a log because the effect is scale invariant. Here's example code #include #include double nan_min(int n, double *data) { int i; double best = data[0]; if (isnan(best)) { return best; } for (i=1; ihttp://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
On Wed, Aug 13, 2008 at 14:37, Joe Harrington <[EMAIL PROTECTED]> wrote: >>On Tue, Aug 12, 2008 at 19:28, Charles R Harris >><[EMAIL PROTECTED]> wrote: >>> >>> >>> On Tue, Aug 12, 2008 at 5:13 PM, Andrew Dalke <[EMAIL PROTECTED]> >>> wrote: On Aug 12, 2008, at 9:54 AM, Anne Archibald wrote: > Er, is this actually a bug? I would instead consider the fact that > np.min([]) raises an exception a bug of sorts - the identity of min is > inf. >>> >>> >>> Personally, I expect that if my array 'x' has a NaN then min(x) must be a NaN. >>> >>> I suppose you could use >>> >>> min(a,b) = (abs(a - b) + a + b)/2 >>> >>> which would have that effect. > >>Or we could implement the inner loop of the minimum ufunc to return >>NaN if there is a NaN. Currently it just compares the two values >>(which causes the unpredictable results since having a NaN on either >>side of the < is always False). I would be amenable to that provided >>that the C isnan() call does not cause too much slowdown in the normal >>case. > > While you're doing that, can you do it so that if keyword nan=False it > returns NaN if NaNs exist, and if keyword nan=True it ignores NaNs? I'm doing nothing. Someone else must volunteer. But I'm not in favor of using a keyword argument. There is a reasonable design rule that if you have a boolean argument which you expect to only be passed literal Trues and Falses, you should instead just have two different functions. Since we already have names staked out for this alternate version (nanmin() and nanmax()), we might as well use them. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
>On Tue, Aug 12, 2008 at 19:28, Charles R Harris ><[EMAIL PROTECTED]> wrote: >> >> >> On Tue, Aug 12, 2008 at 5:13 PM, Andrew Dalke <[EMAIL PROTECTED]> >> wrote: >>> >>> On Aug 12, 2008, at 9:54 AM, Anne Archibald wrote: >>> > Er, is this actually a bug? I would instead consider the fact that >>> > np.min([]) raises an exception a bug of sorts - the identity of min is >>> > inf. >> >> >> >>> >>> Personally, I expect that if my array 'x' has a NaN then >>> min(x) must be a NaN. >> >> I suppose you could use >> >> min(a,b) = (abs(a - b) + a + b)/2 >> >> which would have that effect. >Or we could implement the inner loop of the minimum ufunc to return >NaN if there is a NaN. Currently it just compares the two values >(which causes the unpredictable results since having a NaN on either >side of the < is always False). I would be amenable to that provided >that the C isnan() call does not cause too much slowdown in the normal >case. While you're doing that, can you do it so that if keyword nan=False it returns NaN if NaNs exist, and if keyword nan=True it ignores NaNs? We can argue which should be the default (see my prior post). Both are compatible with the current undefined behavior. I assume that the fastest way to do it is two separate loops for the separate cases, but it might be fast enough straight (with a conditional in the inner loop), or with some other trick (macro magic, function pointer, whatever). Thanks, --jh-- ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
Robert Kern wrote: > Or we could implement the inner loop of the minimum ufunc to return > NaN if there is a NaN. Currently it just compares the two values > (which causes the unpredictable results since having a NaN on either > side of the < is always False). I would be amenable to that provided > that the C isnan() call does not cause too much slowdown in the normal > case. +1 -- this seems to be the only reasonable option. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
On 12/08/08: 18:31, Charles R Harris wrote: >OnTue, Aug 12, 2008 at 6:28 PM, Charles R Harris ><[EMAIL PROTECTED]> wrote: >I suppose you could use >min(a,b) = (abs(a - b) + a + b)/2 >which would have that effect. > >Hmm, that is for the max, min would be >(a + b - |a - b|)/2 This would break when there is an overflow because of addition/subtraction: def new_min(a, b): return (a + b - abs(a-b))/2 a = 1e308 b = -1e308 new_min(a, b) # returns -inf min(a, b) # returns -1e308 -- * * Alok Singhal * * * http://www.astro.virginia.edu/~as8ca/ ** ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
On Tue, Aug 12, 2008 at 19:28, Charles R Harris <[EMAIL PROTECTED]> wrote: > > > On Tue, Aug 12, 2008 at 5:13 PM, Andrew Dalke <[EMAIL PROTECTED]> > wrote: >> >> On Aug 12, 2008, at 9:54 AM, Anne Archibald wrote: >> > Er, is this actually a bug? I would instead consider the fact that >> > np.min([]) raises an exception a bug of sorts - the identity of min is >> > inf. > > > >> >> Personally, I expect that if my array 'x' has a NaN then >> min(x) must be a NaN. > > I suppose you could use > > min(a,b) = (abs(a - b) + a + b)/2 > > which would have that effect. Or we could implement the inner loop of the minimum ufunc to return NaN if there is a NaN. Currently it just compares the two values (which causes the unpredictable results since having a NaN on either side of the < is always False). I would be amenable to that provided that the C isnan() call does not cause too much slowdown in the normal case. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
On Tue, Aug 12, 2008 at 6:28 PM, Charles R Harris <[EMAIL PROTECTED] > wrote: > > > On Tue, Aug 12, 2008 at 5:13 PM, Andrew Dalke <[EMAIL PROTECTED]>wrote: > >> On Aug 12, 2008, at 9:54 AM, Anne Archibald wrote: >> > Er, is this actually a bug? I would instead consider the fact that >> > np.min([]) raises an exception a bug of sorts - the identity of min is >> > inf. >> > > > >> >> Personally, I expect that if my array 'x' has a NaN then >> min(x) must be a NaN. >> > > I suppose you could use > > min(a,b) = (abs(a - b) + a + b)/2 > > which would have that effect. > Hmm, that is for the max, min would be (a + b - |a - b|)/2 > > Chuck > > > ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
On Tue, Aug 12, 2008 at 5:13 PM, Andrew Dalke <[EMAIL PROTECTED]>wrote: > On Aug 12, 2008, at 9:54 AM, Anne Archibald wrote: > > Er, is this actually a bug? I would instead consider the fact that > > np.min([]) raises an exception a bug of sorts - the identity of min is > > inf. > > > Personally, I expect that if my array 'x' has a NaN then > min(x) must be a NaN. > I suppose you could use min(a,b) = (abs(a - b) + a + b)/2 which would have that effect. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
On Aug 12, 2008, at 9:54 AM, Anne Archibald wrote: > Er, is this actually a bug? I would instead consider the fact that > np.min([]) raises an exception a bug of sorts - the identity of min is > inf. That'll break consistency with the normal 'max' function in Python. > Really nanmin of an array containing only nans should be the same > as an empty array; both should be infinity. One thing I expect is that if min(x) exists then there is some i where x[i] "is" min(x) . Returning +inf for min([NaN]) breaks that. However, my expectation doesn't hold true for Python. If I use Python's object identity test 'is' then object identity is lost in numpy.min, although it is preserved under Python's min: >>> import numpy as np >>> x = [200, 300] >>> np.min(x) 200 >>> np.min(x) is x[0] False >>> min(x) is x[0] True >>> and if I use '==' for equality testing then my expectation will fail if isnan(x[i]) because then x[i] != x[i]. >>> import numpy as np >>> np.nan nan >>> np.nan == np.nan False So when I say "is" I means "acts the same as except for in some strange corner cases". Or to put it another way, it should be possible to implement a hypothetical 'argnanmin' just like there is an 'argmin' which complements 'min'. > I guess this is a problem for types that don't have an infinity > (returning maxint is a poor substitute), but what is the correct > behaviour here? "Doctor, doctor it hurts when I do this." "Well, don't do that." Raise an exception. Refuse the temptation to guess. Force the user to handle this case as appropriate. Personally, I expect that if my array 'x' has a NaN then min(x) must be a NaN. Andrew [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
On Tue, Aug 12, 2008 at 10:02 AM, Thomas J. Duck <[EMAIL PROTECTED]> wrote: > > It is quite often the case that NaNs are unexpected, so it > would be helpful to raise an Exception. from numpy import seterr seterr(all = 'warn') Do emit a warning when encountering any kind of floating point error. You can even use raise instead of warn, in which case you will get an exception. cheers, David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
> It really isn't very hard to replace > np.sum(A) > with > np.sum(A[~isnan(A)]) > if you want to ignore NaNs instead of propagating them. So I don't > feel a need for special code in sum() that treats NaN as 0. That's all well and good, until you want to set the axis= keyword. Then you're stuck with looping. As doing stats for each pixel column in a stack of astronomical images with bad pixels and cosmic-ray hits is one of the most common actions in astronomical data analysis, this is an issue for a significant number of current and future users. >>> a=np.arange(9, dtype=float) >>> a.shape=(3,3) >>> a[1,1]=np.nan >>> a array([[ 0., 1., 2.], [ 3., nan, 5.], [ 6., 7., 8.]]) >>> np.sum(a) nan >>> np.sum(a[~np.isnan(a)]) 32.0 Good, but... >>> np.sum(a[~np.isnan(a)], axis=1) Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.5/site-packages/numpy/core/fromnumeric.py", line 634, in sum return sum(axis, dtype, out) ValueError: axis(=1) out of bounds Uh-oh... >>> np.sum(a[~np.isnan(a)], axis=0) 32.0 Worse: wrong answer but not an exception, since >>> a[~np.isnan(a)] array([ 0., 1., 2., 3., 5., 6., 7., 8.]) has the undesired side effect of irreversibly flattening the array. --jh-- ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
Christopher Barker wrote: > well, it's not a bug because the result if there is a NaN is > undefined. > However, it sure could trip people up. If you know there is likely > to be > a NaN in there, then you could use nanmin() or masked arrays. The > problem comes up when you have no idea there might be a NaN in > there, in > which case you get a bogus answer -- this is very bad. This is exactly what happened to me. I was getting crazy results when contour plotting with matplotlib, although the pcolor plots looked fine. In particular, the colorscale had incorrect limits. This led me to check the min() and max() values in my array, which were clearly wrong as illustrated by the pcolor plot. Further investigation revealed unexpected NaNs in my array. > Is there an error state that will trigger an error or warning in these > situations? Otherwise, I'd have to say that the default should be to > test for NaN's, and either raise an error or return NaN. If that > really > does slow things down too much, there could be a flag that lets you > turn > it off. It is quite often the case that NaNs are unexpected, so it would be helpful to raise an Exception. Thanks for all of the helpful discussion on this issue. -- Thomas J. Duck <[EMAIL PROTECTED]> Associate Professor, Department of Physics and Atmospheric Science, Dalhousie University, Halifax, Nova Scotia, Canada, B3H 3J5. Tel: (902)494-1456 | Fax: (902)494-5191 | Lab: (902)494-3813 Web: http://aolab.phys.dal.ca/ ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
Anne Archibald wrote: > 2008/8/12 Joe Harrington <[EMAIL PROTECTED]>: > > >> So, I endorse extending min() and all other statistical routines to >> handle NaNs, possibly with a switch to turn it on if a suitably fast >> algorithm cannot be found (which is competitor IDL's solution). >> Certainly without a switch the default behavior should be to return >> NaN, not to return some random value, if a NaN is present. Otherwise >> the user may never know a NaN is present, and therefore has to check >> every use for NaNs. That constand manual NaN checking is slower and >> more error-prone than any numerical speed advantage. >> >> So to sum, proposed for statistical routnes: >> if NaN is not present, return value >> if NaN is present, return NaN >> if NaN is present and nan=True, return value ignoring all NaNs >> >> OR: >> if NaN is not present, return value >> if NaN is present, return value ignoring all NaNs >> if NaN is present and nan=True, return NaN >> >> I'd prefer the latter. IDL does the former and it is a pain to do >> /nan all the time. However, the latter might trip up the unwary, >> whereas the former never does. >> >> This would apply at least to: >> min >> max >> sum >> prod >> mean >> median >> std >> and possibly many others. >> > > For almost all of these the current behaviour is to propagate NaNs > arithmetically. For example, the sum of anything with a NaN is NaN. I > think this is perfectly sufficient, given how easy it is to strip out > NaNs if that's what you want. The issue that started this thread (and > the many other threads that have come up as users stub their toes on > this behaviour) is that min (and other functions based on comparisons) > do not propagate NaNs. If you do np.amin(A) and A contains NaNs, you > can't count on getting a NaN back, unlike np.mean or np.std. the fact > that you get some random value not the minimum just adds insult to > injury. (It is probably also true that the value you get back depends > on how the array is stored in memory.) > > It really isn't very hard to replace > np.sum(A) > with > np.sum(A[~isnan(A)]) > if you want to ignore NaNs instead of propagating them. So I don't > feel a need for special code in sum() that treats NaN as 0. I would be > content if the comparison-based functions propagated NaNs > appropriately. > > If you did decide it was essential to make versions of the functions > that removed NaNs, it would get you most of the way there to add an > optional keyword argument to ufuncs' reduce method that skipped NaNs. > > Anne > ___ > Numpy-discussion mailing list > Numpy-discussion@scipy.org > http://projects.scipy.org/mailman/listinfo/numpy-discussion > > Actually you probably need to use isfinite because of NumPy's support for IEEE 754 (means NaN is different from infinity). Also, doesn't this also require an additional temporary copy of A? The problem I have with this is that you must always know in advance that NaNs or infinities are present and assumes you want to ignore them. Alternatively something simple like a new function. Bruce import numpy as np def minnan(x, axis=None, out=None, hasnan=False): if hasnan: return np.nanmin(x,axis) elif np.isfinite(x).all(): return np.min(x,axis, out) else: return np.nan # actually should be something else here x = np.array([1,2,np.nan,4,5,6]) y = np.array([1,2,3,4,5,6]) print 'NumPy Min:', np.min(x) print 'NumPy NaNMin:', np.nanmin(x) print 'NumPy MinNaN:', minnan(x) print 'NumPy MinNaN T:', minnan(x, hasnan=True) print 'NumPy Min:', np.min(y) print 'NumPy NaNMin:', np.nanmin(y) print 'NumPy MinNan:', minnan(y) print 'NumPy MinNaN T:', minnan(y, hasnan=True) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
On Tue, Aug 12, 2008 at 1:46 AM, Andrew Dalke <[EMAIL PROTECTED]>wrote: > Here's the implementation, from lib/function_base.py > > def nanmin(a, axis=None): > """Find the minimium over the given axis, ignoring NaNs. > """ > y = array(a,subok=True) > if not issubclass(y.dtype.type, _nx.integer): > y[isnan(a)] = _nx.inf > return y.min(axis) > > No wonder nanmin is slow. A C implementation would run at virtually the same speed as min. If there is interest, I'll be happy to code C versions. A better solution would be to just support NaNs and Inf in the generic code. -Kevin ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
2008/8/12 Joe Harrington <[EMAIL PROTECTED]>: > So, I endorse extending min() and all other statistical routines to > handle NaNs, possibly with a switch to turn it on if a suitably fast > algorithm cannot be found (which is competitor IDL's solution). > Certainly without a switch the default behavior should be to return > NaN, not to return some random value, if a NaN is present. Otherwise > the user may never know a NaN is present, and therefore has to check > every use for NaNs. That constand manual NaN checking is slower and > more error-prone than any numerical speed advantage. > > So to sum, proposed for statistical routnes: > if NaN is not present, return value > if NaN is present, return NaN > if NaN is present and nan=True, return value ignoring all NaNs > > OR: > if NaN is not present, return value > if NaN is present, return value ignoring all NaNs > if NaN is present and nan=True, return NaN > > I'd prefer the latter. IDL does the former and it is a pain to do > /nan all the time. However, the latter might trip up the unwary, > whereas the former never does. > > This would apply at least to: > min > max > sum > prod > mean > median > std > and possibly many others. For almost all of these the current behaviour is to propagate NaNs arithmetically. For example, the sum of anything with a NaN is NaN. I think this is perfectly sufficient, given how easy it is to strip out NaNs if that's what you want. The issue that started this thread (and the many other threads that have come up as users stub their toes on this behaviour) is that min (and other functions based on comparisons) do not propagate NaNs. If you do np.amin(A) and A contains NaNs, you can't count on getting a NaN back, unlike np.mean or np.std. the fact that you get some random value not the minimum just adds insult to injury. (It is probably also true that the value you get back depends on how the array is stored in memory.) It really isn't very hard to replace np.sum(A) with np.sum(A[~isnan(A)]) if you want to ignore NaNs instead of propagating them. So I don't feel a need for special code in sum() that treats NaN as 0. I would be content if the comparison-based functions propagated NaNs appropriately. If you did decide it was essential to make versions of the functions that removed NaNs, it would get you most of the way there to add an optional keyword argument to ufuncs' reduce method that skipped NaNs. Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
2008/8/12 Stéfan van der Walt <[EMAIL PROTECTED]>: > Hi Andrew > > 2008/8/12 Andrew Dalke <[EMAIL PROTECTED]>: >> This is buggy for the case of a list containing only NaNs. >> >> >>> import numpy as np >> >>> np.NAN >> nan >> >>> np.min([np.NAN]) >> nan >> >>> np.nanmin([np.NAN]) >> inf >> >>> > > Thanks for the report. This should be fixed in r5630. Er, is this actually a bug? I would instead consider the fact that np.min([]) raises an exception a bug of sorts - the identity of min is inf. Really nanmin of an array containing only nans should be the same as an empty array; both should be infinity. I guess this is a problem for types that don't have an infinity (returning maxint is a poor substitute), but what is the correct behaviour here? Anne ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
Masked arrays are a bit clunky for something as simple and standard as NaN handling. They also have the inverse of the standard truth sense, at least as used in my field. 1 (or True) usually means the item is allowed, not denied, so that you can multiply the mask by the data to zero all bad values, add and subtract masks in sensible ways and get what's expected, etc. For example, in the "stacked, masked mean" image processing algorithm, you sum the data along an axis, sum the masks along that axis, and divide the results to get the mean image without bad pixels. This is much more accurate than taking a median, and admits to error analysis, which the median does not (easily). While the regular behavior is "just a ~ away", as Stefan pointed out to me once, that's not acceptable if the image cube is large and memory or speed are at issue, and it's also very prone to bugs if you're negating everything all the time. Further, with ma you have to convert to using an entirely different and redundant set of routines instead of having the very standard handling of NaNs found in our competitor programs, such as IDL. The issue of not having an in-place method in ma was also raised earlier. I'll add the difficulty of converting code if a standard thing like NaN handling has to be simulated in multiple calls. So, I endorse extending min() and all other statistical routines to handle NaNs, possibly with a switch to turn it on if a suitably fast algorithm cannot be found (which is competitor IDL's solution). Certainly without a switch the default behavior should be to return NaN, not to return some random value, if a NaN is present. Otherwise the user may never know a NaN is present, and therefore has to check every use for NaNs. That constand manual NaN checking is slower and more error-prone than any numerical speed advantage. So to sum, proposed for statistical routnes: if NaN is not present, return value if NaN is present, return NaN if NaN is present and nan=True, return value ignoring all NaNs OR: if NaN is not present, return value if NaN is present, return value ignoring all NaNs if NaN is present and nan=True, return NaN I'd prefer the latter. IDL does the former and it is a pain to do /nan all the time. However, the latter might trip up the unwary, whereas the former never does. This would apply at least to: min max sum prod mean median std and possibly many others. --jh-- ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
Hi Andrew 2008/8/12 Andrew Dalke <[EMAIL PROTECTED]>: > This is buggy for the case of a list containing only NaNs. > > >>> import numpy as np > >>> np.NAN > nan > >>> np.min([np.NAN]) > nan > >>> np.nanmin([np.NAN]) > inf > >>> Thanks for the report. This should be fixed in r5630. Regards Stéfan ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
On Aug 12, 2008, at 7:05 AM, Christopher Barker wrote: > Actually, I think it skips over NaN -- otherwise, the min would always > be zero if there where a Nan, and "a very small negative number" if > there were a -inf. Here's the implementation, from lib/function_base.py def nanmin(a, axis=None): """Find the minimium over the given axis, ignoring NaNs. """ y = array(a,subok=True) if not issubclass(y.dtype.type, _nx.integer): y[isnan(a)] = _nx.inf return y.min(axis) This is buggy for the case of a list containing only NaNs. >>> import numpy as np >>> np.NAN nan >>> np.min([np.NAN]) nan >>> np.nanmin([np.NAN]) inf >>> Andrew [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
Bruce Southey wrote: > Actually this could be viewed as a bug because it ignores the entries > to the left of the NaN. well, it's not a bug because the result if there is a NaN is undefined. However, it sure could trip people up. If you know there is likely to be a NaN in there, then you could use nanmin() or masked arrays. The problem comes up when you have no idea there might be a NaN in there, in which case you get a bogus answer -- this is very bad. Is there an error state that will trigger an error or warning in these situations? Otherwise, I'd have to say that the default should be to test for NaN's, and either raise an error or return NaN. If that really does slow things down too much, there could be a flag that lets you turn it off. This situation now makes me very nervous. > because > nanmin treats NaNs as zero, positive infinity as a really large > positive number and negative infinity as a very small or negative > number. Actually, I think it skips over NaN -- otherwise, the min would always be zero if there where a Nan, and "a very small negative number" if there were a -inf. I have to say that one of the things I always liked about Matlab was it's handling of NaN, inf, and -inf. -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
I agree with using Masked arrays... Actually this could be viewed as a bug because it ignores the entries to the left of the NaN. >>> numpy.__version__ '1.1.1.dev5559' >>> x = numpy.array([0,1,2,numpy.nan, 4, 5, 6]) >>> numpy.min(x) 4.0 >>> x = numpy.array([numpy.nan,0,1,2, 4, 5, 6]) >>> x.min() 0.0 >>> x = numpy.array([0,1,2, 4, 5, 6, numpy.nan]) >>> x.min() -1.#IND As has been recently said on this list (as per Stefan's post) NaN's and infinity have a higher computational cost. I am not sure the relative cost of using say isnan first as a check or having a NaN flag stored as part of the ndarray class. As per Travis's post, technically it should return NaN. But I don't agree with Charles that it should automatically call nanmin because nanmin treats NaNs as zero, positive infinity as a really large positive number and negative infinity as a very small or negative number. This may not be want the user wants. An alternative is to change the signature to include a flag to include or exclude NaN and infinity which would also remove the need for nanmin and friends. Bruce On Mon, Aug 11, 2008 at 6:41 PM, Pierre GM <[EMAIL PROTECTED]> wrote: > *cough* MaskedArrays anyone ? *cough* > > The ideal would be for min/max to output a NaN when there's a NaN somewhere. > That way, you'd know that there's a potential pb in your data, and that you > should use the nanfunctions or masked arrays. > > is there a page on the wiki for that matter ? It seems to show up regularly... > > On Monday 11 August 2008 18:49:06 Stéfan van der Walt wrote: >> 2008/8/11 Charles Doutriaux <[EMAIL PROTECTED]>: >> > Seems to me like min should automagically call nanmin if it spots any >> > nan no ? >> >> Nanmin is quite a bit slower: >> >> In [2]: x = np.random.random((5000)) >> >> In [3]: timeit np.min(x) >> 1 loops, best of 3: 24.8 µs per loop >> >> In [4]: timeit np.nanmin(x) >> 1 loops, best of 3: 136 µs per loop >> >> So, I'm not sure if that will happen. One option is to use `nanmin` >> by default, and to provide `min` for people who need the speed. The >> fact that results with nan's are almost always unexpected is certainly >> a valid concern. >> >> Cheers >> Stéfan >> ___ >> Numpy-discussion mailing list >> Numpy-discussion@scipy.org >> http://projects.scipy.org/mailman/listinfo/numpy-discussion > > > ___ > Numpy-discussion mailing list > Numpy-discussion@scipy.org > http://projects.scipy.org/mailman/listinfo/numpy-discussion > ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
*cough* MaskedArrays anyone ? *cough* The ideal would be for min/max to output a NaN when there's a NaN somewhere. That way, you'd know that there's a potential pb in your data, and that you should use the nanfunctions or masked arrays. is there a page on the wiki for that matter ? It seems to show up regularly... On Monday 11 August 2008 18:49:06 Stéfan van der Walt wrote: > 2008/8/11 Charles Doutriaux <[EMAIL PROTECTED]>: > > Seems to me like min should automagically call nanmin if it spots any > > nan no ? > > Nanmin is quite a bit slower: > > In [2]: x = np.random.random((5000)) > > In [3]: timeit np.min(x) > 1 loops, best of 3: 24.8 µs per loop > > In [4]: timeit np.nanmin(x) > 1 loops, best of 3: 136 µs per loop > > So, I'm not sure if that will happen. One option is to use `nanmin` > by default, and to provide `min` for people who need the speed. The > fact that results with nan's are almost always unexpected is certainly > a valid concern. > > Cheers > Stéfan > ___ > Numpy-discussion mailing list > Numpy-discussion@scipy.org > http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
2008/8/11 Charles Doutriaux <[EMAIL PROTECTED]>: > Seems to me like min should automagically call nanmin if it spots any > nan no ? Nanmin is quite a bit slower: In [2]: x = np.random.random((5000)) In [3]: timeit np.min(x) 1 loops, best of 3: 24.8 µs per loop In [4]: timeit np.nanmin(x) 1 loops, best of 3: 136 µs per loop So, I'm not sure if that will happen. One option is to use `nanmin` by default, and to provide `min` for people who need the speed. The fact that results with nan's are almost always unexpected is certainly a valid concern. Cheers Stéfan ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
Seems to me like min should automagically call nanmin if it spots any nan no ? C. Fabrice Silva wrote: > Try nanmin function : > > $ python > Python 2.5.2 (r252:60911, Jul 31 2008, 07:39:27) > [GCC 4.3.1] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import numpy > >>> numpy.__version__ > '1.1.0' > >>> x = numpy.array([0,1,2,numpy.nan, 4, 5, 6]) > >>> x.min() > 4.0 > >>> numpy.nanmin(x) > 0.0 > > There lacks some nanmin method for array instances, i.e. one can not execute > >>> x.nanmin() > Traceback (most recent call last): > File "", line 1, in > AttributeError: 'numpy.ndarray' object has no attribute 'nanmin' > > ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
Try nanmin function : $ python Python 2.5.2 (r252:60911, Jul 31 2008, 07:39:27) [GCC 4.3.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import numpy >>> numpy.__version__ '1.1.0' >>> x = numpy.array([0,1,2,numpy.nan, 4, 5, 6]) >>> x.min() 4.0 >>> numpy.nanmin(x) 0.0 There lacks some nanmin method for array instances, i.e. one can not execute >>> x.nanmin() Traceback (most recent call last): File "", line 1, in AttributeError: 'numpy.ndarray' object has no attribute 'nanmin' -- Fabrice Silva LMA UPR CNRS 7051 - équipe S2M ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
Thomas J. Duck wrote: > Determining the minimum value of an array that contains NaN produces > a surprising result: > > >>> x = numpy.array([0,1,2,numpy.nan,4,5,6]) > >>> x.min() > 4.0 > > I expected 0.0. Is this the intended behaviour or a bug? I am using > numpy 1.1.1. > NAN's don't play well with comparisons because comparison with them is undefined.See numpy.nanmin -Travis ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] min() of array containing NaN
Determining the minimum value of an array that contains NaN produces a surprising result: >>> x = numpy.array([0,1,2,numpy.nan,4,5,6]) >>> x.min() 4.0 I expected 0.0. Is this the intended behaviour or a bug? I am using numpy 1.1.1. Thanks, Tom -- Thomas J. Duck <[EMAIL PROTECTED]> Associate Professor, Department of Physics and Atmospheric Science, Dalhousie University, Halifax, Nova Scotia, Canada, B3H 3J5. Tel: (902)494-1456 | Fax: (902)494-5191 | Lab: (902)494-3813 Web: http://aolab.phys.dal.ca/ ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion