Re: [Numpy-discussion] Medians that ignore values

2008-09-22 Thread Peter Saffrey
David Cournapeau  ar.media.kyoto-u.ac.jp> writes:

> Unfortunately, we can't, because we would loose generality: we need to
> compute median on any axis, not only the last one. The proper solution
> would be to have a sort/max/min/etc... which knows about nan in numpy,
> which is what Chuck and I are working on ATM,
> 

Of course - thanks for looking at this.

Peter

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-22 Thread David Cournapeau
Peter Saffrey wrote:
>
> I've found that if I just cut nans from the list and use regular numpy median,
> it is quicker - 10 times slower than list median, rather than 35 times slower.
> Could you just wire nanmedian to do it this way? 

Unfortunately, we can't, because we would loose generality: we need to
compute median on any axis, not only the last one. The proper solution
would be to have a sort/max/min/etc... which knows about nan in numpy,
which is what Chuck and I are working on ATM,

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-22 Thread Peter Saffrey
David Cournapeau  ar.media.kyoto-u.ac.jp> writes:

> Still, it is indeed really slow for your case; when I fixed nanmean and
> co, I did not know much about numpy, I just wanted them to give the
> right answer :) I think this can be made faster, specially for your case
> (where the axis along which the median is computed is really small).
> 

I've found that if I just cut nans from the list and use regular numpy median,
it is quicker - 10 times slower than list median, rather than 35 times slower.
Could you just wire nanmedian to do it this way? The only difference is that on
an empty list, nanmedian gives nan, but median throws an IndexError.

Below is my profiling code with this change. Sample output:

$ ./arrayspeed3.py
list build time: 0.16
list median time: 0.08
array nanmedian time: 0.98

Peter

===

from numpy import *
from pylab import rand
from time import clock
from scipy.stats.stats import nanmedian

def my_median(vallist):
num_vals = len(vallist)
if num_vals == 0:
return nan
vallist.sort()
if num_vals % 2 == 1: # odd
index = (num_vals - 1) / 2
return vallist[index]
else: # even
index = num_vals / 2
return (vallist[index] + vallist[index - 1]) / 2

numtests = 100
testsize = 1000
pointlen = 3

t0 = clock()
natests = rand(numtests,testsize,pointlen)
# have to start with inf because list.remove(nan) doesn't remove nan
natests[natests > 0.9] = inf
tests = natests.tolist()
natests[natests==inf] = nan
for test in tests:
for point in test:
while inf in point:
point.remove(inf)
t1 = clock()
print "list build time:", t1-t0


allmedians = []
t0 = clock()
for test in tests:
medians = [ my_median(x) for x in test ]
allmedians.append(medians)
t1 = clock()
print "list median time:", t1-t0

t0 = clock()
namedians = []
for natest in natests:
thismed = []
for point in natest:
maskpoint = point[negative(isnan(point))]
if len(maskpoint) > 0:
med = median(maskpoint)
else:
med = nan
thismed.append(med)
namedians.append(thismed)
t1 = clock()
print "array nanmedian time:", t1-t0




___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-21 Thread David Cournapeau
David Cournapeau wrote:
>
> The isnan thing is surprising, because the whole point to have a isnan
> is that you can do it without branching. I checked, and numpy does use
> the macro of isnan, not the function (glibc has both).

Ok, see my patch #913 for this. The slowdown is actually specific to one
tested machine (my P4). On my macbook (running Mac os X) and another
linux machine running a core 2 duo, the performances are the same before
and after the patch. I have not tested on windows, though.

I also saw this mentioned:

http://projects.scipy.org/scipy/numpy/ticket/241

Where Travis made the same argument as me concerning NaN. It seems that
the slowdowns are not so significant, at least on the dataset I tested
(isnan is actually quite fast on my core 2 duo: 10 cycles / double for
large arrays on average, compared to the 60 / double on my P4 for the
exact same binary).

Travis, if you are reading this, would you reconsider your position on
nan handling for min/max/co if we can keep reasonable speed ?

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-21 Thread Charles R Harris
On Sun, Sep 21, 2008 at 12:56 AM, David Cournapeau <
[EMAIL PROTECTED]> wrote:

> David Cournapeau wrote:
> > Anne Archibald wrote:
> >> If users are concerned about performance, it's worth noting that on
> >> some machines nans force a fallback to software floating-point
> >> handling, with a corresponding very large performance hit. This
> >> includes some but not all x86 (and I think x86-64) CPUs. How this
> >> compares to the performance of masked arrays is not clear.
> >
> > I spent some time on this. In particular, for max.min, I did the
> > following for the core loop (always return nan if nan is in the array):
> >
> >  /* nan + x and x + nan are nan, where x can be anything:
> > normal,
> >   * denormal, nan, infinite
> > */
> >   tmp = *((@typ@ *)i1) + *((@typ@
> > *)i2);
> >   if(isnan(tmp))
> > {
> > *((@typ@ *)op) =
> > tmp;
> >   } else
> > {
> > *((@typ@ *)op)=*((@typ@ *)i1) @OP@ *((@typ@ *)i2) ? *((@typ@
> > *)i1) : *((@typ@ *)i2);
> >   }
>
> Grr, sorry for the mangling:
>
> /* nan + x and x + nan are nan, where x can be anything: normal,
>  * denormal, nan, infinite */
> tmp = *((@typ@ *)i1) + *((@[EMAIL PROTECTED])i2);
> if(isnan(tmp)) {
>*((@typ@ *)op) = tmp;
> } else {
>*((@typ@ *)op) = *((@typ@ *)i1) @OP@ *((@typ@ *)i2) ? *((@[EMAIL 
> PROTECTED])i1) :
> *((@typ@ *)i2);
> }
>

You can use type instead of typ so the code is a bit easier to read. It's
one of the changes I've made.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-21 Thread David Cournapeau
David Cournapeau wrote:
> Anne Archibald wrote:
>> If users are concerned about performance, it's worth noting that on
>> some machines nans force a fallback to software floating-point
>> handling, with a corresponding very large performance hit. This
>> includes some but not all x86 (and I think x86-64) CPUs. How this
>> compares to the performance of masked arrays is not clear.
>
> I spent some time on this. In particular, for max.min, I did the
> following for the core loop (always return nan if nan is in the array):
>
>  /* nan + x and x + nan are nan, where x can be anything:
> normal,
>   * denormal, nan, infinite
> */   
>   tmp = *((@typ@ *)i1) + *((@typ@
> *)i2);  
>   if(isnan(tmp))
> {
> *((@typ@ *)op) =
> tmp;   
>   } else
> {
> *((@typ@ *)op)=*((@typ@ *)i1) @OP@ *((@typ@ *)i2) ? *((@typ@
> *)i1) : *((@typ@ *)i2);
>   }

Grr, sorry for the mangling:

/* nan + x and x + nan are nan, where x can be anything: normal,
 * denormal, nan, infinite */
tmp = *((@typ@ *)i1) + *((@[EMAIL PROTECTED])i2);
if(isnan(tmp)) {
*((@typ@ *)op) = tmp;
} else {
*((@typ@ *)op) = *((@typ@ *)i1) @OP@ *((@typ@ *)i2) ? *((@[EMAIL 
PROTECTED])i1) : *((@typ@ *)i2);
}

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-21 Thread David Cournapeau
Anne Archibald wrote:
>
> If users are concerned about performance, it's worth noting that on
> some machines nans force a fallback to software floating-point
> handling, with a corresponding very large performance hit. This
> includes some but not all x86 (and I think x86-64) CPUs. How this
> compares to the performance of masked arrays is not clear.

I spent some time on this. In particular, for max.min, I did the
following for the core loop (always return nan if nan is in the array):

 /* nan + x and x + nan are nan, where x can be anything:
normal,
  * denormal, nan, infinite
*/   
  tmp = *((@typ@ *)i1) + *((@typ@
*)i2);  
  if(isnan(tmp))
{
*((@typ@ *)op) =
tmp;   
  } else
{
*((@typ@ *)op)=*((@typ@ *)i1) @OP@ *((@typ@ *)i2) ? *((@typ@
*)i1) : *((@typ@ *)i2);
  }

For large arrays (on my CPU, it is around 1 items), the function is
3x slower than the original one. I think the main cost is the isnan. 3x
is quite expensive, so I tested a bit isnan on Linux, and it is
surprisingly slow. If I use my own, trivial @define isnan(x) ((x) !=
(x)), it is twice faster than the glibc isnan, and then max/min are as
fast as before, except they are working :)

The isnan thing is surprising, because the whole point to have a isnan
is that you can do it without branching. I checked, and numpy does use
the macro of isnan, not the function (glibc has both).

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-20 Thread Jake Harris
On Sat, Sep 20, 2008 at 11:02 AM, Jake Harris <[EMAIL PROTECTED]>wrote:

>
> Because you're always working with probabilities, there is almost always no
> ambiguity...whenever NaN is encounter, 0 is what is desired.
>

...of course, division presents a good counterexample.


> Bad idea?
>


So probably.
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-20 Thread Jake Harris
(sorry for starting a new thread...I wasn't subscribed yet)

Stéfan van der Walt wrote the following on 09/19/2008 02:10 AM:
>
> So am I.  In all my use cases, NaNs indicate trouble.
>

I can provide a use case where NaNs do not indicate trouble.  In fact, they
need to be treated as 0.  For example,

As x->0 in y(x) = x log x, it is traditional (eg in information theory) to
take y(0) = 0.  So if one is multiplying arrays and 0 * -inf  is
encountered, the desirable behavior is that we get 0.   Because you're
always working with probabilities, there is almost always no
ambiguity...whenever NaN is encounter, 0 is what is desired.

Perhaps numpy can have some method by which a user can specify how NaNis
treated (in addition to ignore, raise, etc). Good idea? Bad idea?
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-20 Thread David Cournapeau
Charles R Harris wrote:
>
>
>
> I would be happy to implement nan sorts if someone can provide me with
> a portable and easy way to detect nans for single, double, and long
> double floats. And not have it fail if the architecture doesn't
> support nans. I think getting all the needed nan detection and setup
> in place is the first step for anything else.

I guess you mean when isnan is available but broken, since we do not
support platforms without IEEE 754 support ? I want to take care of this
for my umathmodule cleaning (all the configuration checks/replacements
are in place; if we want to be paranoid, we could check whether isnan
works for all types if found on the system).

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Anne Archibald
2008/9/19 Eric Firing <[EMAIL PROTECTED]>:
> Pierre GM wrote:
>
>>> It seems to me that there are pragmatic reasons
>>> why people work with NaNs for missing values,
>>> that perhaps shd not be dismissed so quickly.
>>> But maybe I am overlooking a simple solution.
>>
>> nansomething solutions tend to be considerably faster, that might be one
>> reason. A lack of visibility of numpy.ma could be a second. In any case, I
>> can't but agree with other posters: a NaN in an array usually means something
>> went astray.
>
> Additional reasons for using nans:
>
> 1) years of experience with Matlab, in which using nan for missing
> values is the standard idiom.

Users are already retraining to use zero-based indexing; I don't think
asking them to use a full-featured masked array package is an
unreasonable retraining burden, particularly since this idiom breaks
as soon as they want to work with arrays of integers or records.

> 2) convenient interfacing with extension code in C or C++.
>
> The latter is a factor in the present use of nan in matplotlib; using
> nan for missing values in an array passed into extension code saves
> having to pass and process a second (mask) array.  It is fast and simple.

How hard is it to pass an array where the masked values have been
filled with nans? It's certainly easy to go the other way (mask all
nans). I think this is less painful than supporting two
differently-featured sets of functions for dealing with arrays
containing some invalid values.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Robert Kern
On Sat, Sep 20, 2008 at 01:15, Charles R Harris
<[EMAIL PROTECTED]> wrote:

> I would be happy to implement nan sorts if someone can provide me with a
> portable and easy way to detect nans for single, double, and long double
> floats. And not have it fail if the architecture doesn't support nans. I
> think getting all the needed nan detection and setup in place is the first
> step for anything else.

We explicitly only support IEEE-754 architectures, so we are always on
an architecture that supports NaNs.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Charles R Harris
On Fri, Sep 19, 2008 at 11:41 PM, David Cournapeau <
[EMAIL PROTECTED]> wrote:

> Anne Archibald wrote:
> >
> > I, on the other hand, was making specifically that suggestion: users
> > should not use nans to indicate missing values. Users should use
> > masked arrays to indicate missing values.
>
> I agree it is the nicest solution in theory, but I think it is
> impractical (as mentioned by Eric Firing in his email).
>
> >
> > This part I pretty much agree with.
>
> I can't really see which one is better (failing or returning NaN for
> sort/min/max and their sort counterpat), or if we should let the choice
> be left to the user. I am fine with both, and they both require the same
> amount of work.
>
> >  Or we can make them behave drastically differently.
> > Masked arrays clearly need to be able to handle masked values flexibly
> > and explicitly. So I think nans should be handled simply and
> > conservatively: propagate them if possible, raise if not.
>
> I agree about this behavior being the default. I just think that for a
> couple of functions, we could we give either separate functions, or
> additional arguments to existing functions to ignore them: I am thinking
> about min/max/sort and their arg* counterpart, because those are really
> basic, and because we already have nanmean/nanstd/nanmedian (e.g. having
> a nansort would help for nanmean to be much faster).
>

I would be happy to implement nan sorts if someone can provide me with a
portable and easy way to detect nans for single, double, and long double
floats. And not have it fail if the architecture doesn't support nans. I
think getting all the needed nan detection and setup in place is the first
step for anything else.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread David Cournapeau
Anne Archibald wrote:
>
> I, on the other hand, was making specifically that suggestion: users
> should not use nans to indicate missing values. Users should use
> masked arrays to indicate missing values.

I agree it is the nicest solution in theory, but I think it is
impractical (as mentioned by Eric Firing in his email).

>
> This part I pretty much agree with.

I can't really see which one is better (failing or returning NaN for
sort/min/max and their sort counterpat), or if we should let the choice
be left to the user. I am fine with both, and they both require the same
amount of work.

>  Or we can make them behave drastically differently.
> Masked arrays clearly need to be able to handle masked values flexibly
> and explicitly. So I think nans should be handled simply and
> conservatively: propagate them if possible, raise if not.

I agree about this behavior being the default. I just think that for a
couple of functions, we could we give either separate functions, or
additional arguments to existing functions to ignore them: I am thinking
about min/max/sort and their arg* counterpart, because those are really
basic, and because we already have nanmean/nanstd/nanmedian (e.g. having
a nansort would help for nanmean to be much faster).

>
> If users are concerned about performance, it's worth noting that on
> some machines nans force a fallback to software floating-point
> handling, with a corresponding very large performance hit.

I was more concerned with the cost of treating NaN when you do not have
NaN in your array when you have to treat for NaN explicitely (everything
involving comparison). But I don't see any obvious way to avoid that cost,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Anne Archibald
2008/9/19 David Cournapeau <[EMAIL PROTECTED]>:

> I guess my formulation was poor: I never use NaN as missing values
> because I never use missing values, which is why I wanted the opinion of
> people who use NaN in a different manner (because I don't have a good
> idea on how those people would like to see numpy behave). I was
> certainly not arguing they should not be use for the purpose of missing
> value.

I, on the other hand, was making specifically that suggestion: users
should not use nans to indicate missing values. Users should use
masked arrays to indicate missing values.

> The problem with NaN is that you cannot mix the missing value behavior
> and the error behavior. Dealing with them in a consistent manner is
> difficult. Because numpy is a general numerical computation tool, I
> think that NaN should be propagated and never ignored *by default*. If
> you have NaN because of divide by 0, etc... it should not be ignored at
> all. But if you want it to ignore, then numpy should make it possible:
>
>- max, min: should return NaN if NaN is in the array, or maybe even
> fail ?
>- argmax, argmin ?
>- sort: should fail ?
>- mean, std, variance: should return Nan
>- median: should fail (to be consistent if sort fails) ? Should
> return NaN ?

This part I pretty much agree with.

> We could then add an argument to failing functions to tell them either
> to ignore NaN/put them at some special location (like R does, for
> example). The ones I am not sure are median and argmax/argmin. For
> median, failing when sort does is consistent; but this can break a lot
> of code. For argmin/argmax, failing is the most logical, but OTOH,
> making argmin/argmax failing and not max/min is not consistent either.
> Breaking the code is maybe not that bad because currently, neither
> max/min nor argmax/argmin nor sort does return a meaningful function.
> Does that sound reasonable to you ?

The problem with this approach is that all those decisions need to be
made and all that code needs to be implemented for masked arrays. In
fact I suspect that it has already been done in that case. So really
what you are suggesting here is that we duplicate all this effort to
implement the same functions for nans as we have for masked arrays.
It's important, too, that the masked array implementation and the nan
implementation behave the same way, or users will become badly
confused. Who gets the task of keeping the two implementations in
sync?

The current situation is that numpy has two ways to indicate bad data
for floating-point arrays: nans and masked arrays. We can't get rid of
either: nans appear on their own, and masked arrays are the only way
to mark bad data in non-floating-point arrays. We can try to make them
behave the same, which will be a lot of work to provide redundant
capabilities. Or we can make them behave drastically differently.
Masked arrays clearly need to be able to handle masked values flexibly
and explicitly. So I think nans should be handled simply and
conservatively: propagate them if possible, raise if not.

If users are concerned about performance, it's worth noting that on
some machines nans force a fallback to software floating-point
handling, with a corresponding very large performance hit. This
includes some but not all x86 (and I think x86-64) CPUs. How this
compares to the performance of masked arrays is not clear.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread David Cournapeau
Robert Kern wrote:
> On Fri, Sep 19, 2008 at 22:25, David Cournapeau
> <[EMAIL PROTECTED]> wrote:
>   
>
> How, exactly? ndarray.min() is the where the implementation is.
>   

Ah, I keep forgetting those are implemented in the array object, sorry
for that. Now I understand Stefan point. Do I understand correctly that
we should then do:
- implement a min/max NaN aware for every float type (real and
complex) in umathmodule.c, which ignores nan (called @[EMAIL PROTECTED], etc...)
- fix the current min/max to propagate NaN instead of giving broken
result
- How to do the dispatching ? Having PyArray_Min and PyArray_NanMin
sounds the easiest (we don't change any C api, only add an argument to
the python-callable function min, in array_min method ?)

Or am I missing something ? If this is the right way to fix it I am
willing to do it (we still have to agree on the default behavior first).
I am not really familiar with sort module, but maybe it is really
similar to min/max case.

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread David Cournapeau
Alan G Isaac wrote:
> On 9/19/2008 4:35 AM David Cournapeau apparently wrote:
>> I never use NaN as missing value
>
> What do you use?
>
> Recently I needed to fill a 2d array with values
> from computations that could "go wrong".
> I created an array of NaN and then replaced
> the elements where the computation produced
> a useful value.  I then applied ``nanmax``,
> to get the maximum of the useful values.
>
> What should I have done?

I guess my formulation was poor: I never use NaN as missing values
because I never use missing values, which is why I wanted the opinion of
people who use NaN in a different manner (because I don't have a good
idea on how those people would like to see numpy behave). I was
certainly not arguing they should not be use for the purpose of missing
value.

The problem with NaN is that you cannot mix the missing value behavior
and the error behavior. Dealing with them in a consistent manner is
difficult. Because numpy is a general numerical computation tool, I
think that NaN should be propagated and never ignored *by default*. If
you have NaN because of divide by 0, etc... it should not be ignored at
all. But if you want it to ignore, then numpy should make it possible:

- max, min: should return NaN if NaN is in the array, or maybe even
fail ?
- argmax, argmin ?
- sort: should fail ?
- mean, std, variance: should return Nan
- median: should fail (to be consistent if sort fails) ? Should
return NaN ?

We could then add an argument to failing functions to tell them either
to ignore NaN/put them at some special location (like R does, for
example). The ones I am not sure are median and argmax/argmin. For
median, failing when sort does is consistent; but this can break a lot
of code. For argmin/argmax, failing is the most logical, but OTOH,
making argmin/argmax failing and not max/min is not consistent either.
Breaking the code is maybe not that bad because currently, neither
max/min nor argmax/argmin nor sort does return a meaningful function.
Does that sound reasonable to you ?

cheer,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Robert Kern
On Fri, Sep 19, 2008 at 22:25, David Cournapeau
<[EMAIL PROTECTED]> wrote:
> Stéfan van der Walt wrote:
>>
>> Why shouldn't we have "nanmin"-like behaviour for the C min itself?
>>
>
> Ah, I was not arguing we should not do it in C, but rather we did not
> have to do in C. The current behavior for nan with functions relying on
> ordering is broken; if someone prefer fixing it in C, great. But I was
> guessing more people could fix it using python, that's all.

How, exactly? ndarray.min() is the where the implementation is.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread David Cournapeau
Stéfan van der Walt wrote:
>
> Why shouldn't we have "nanmin"-like behaviour for the C min itself?
>   

Ah, I was not arguing we should not do it in C, but rather we did not
have to do in C. The current behavior for nan with functions relying on
ordering is broken; if someone prefer fixing it in C, great. But I was
guessing more people could fix it using python, that's all.

I opened a bug for min/max and nan, this should be fixed for 1.3.0,
maybe 1.2.1 too.

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Pierre GM
On Friday 19 September 2008 17:25:53 Alan G Isaac wrote:
> On 9/19/2008 4:54 PM Pierre GM apparently wrote:
> > Another way is
> > ma.array(np.empty(yourshape,yourdtype), mask=True)
> > which should work with earlier versions.
>
> Seems like ``mask`` would be a natural
> keyword for ``ma.empty``?

Not a bad idea. I'll plug that in.
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Alan G Isaac
On 9/19/2008 4:54 PM Pierre GM apparently wrote:
> Another way is 
> ma.array(np.empty(yourshape,yourdtype), mask=True)
> which should work with earlier versions.

Seems like ``mask`` would be a natural
keyword for ``ma.empty``?

Thanks,
Alan Isaac

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Pierre GM
On Friday 19 September 2008 16:35:23 Alan G Isaac wrote:
> On 9/19/2008 4:54 AM Pierre GM apparently wrote:
> > I know. I was more dreading the time when MaskedArrays would have to be
> > ported to C. In a way, that would probably simplify a few issues. OTOH, I
> > don't really see it happening any time soon.
>
> Is this possibly a GSoC sized project?
> Alan Isaac

If we can find someone who knows C and masked arrays well, that could be.


___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Pierre GM
On Friday 19 September 2008 16:28:34 Alan G Isaac wrote:
> On 9/19/2008 11:46 AM Pierre GM apparently wrote:
>  a.mask=True

> This is great, but is apparently
> new behavior as of NumPy 1.2?

I'm not sure, sorry. Another way is 
ma.array(np.empty(yourshape,yourdtype), mask=True)
which should work with earlier versions.
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Alan G Isaac
On 9/19/2008 4:54 AM Pierre GM apparently wrote:
> I know. I was more dreading the time when MaskedArrays would have to be 
> ported 
> to C. In a way, that would probably simplify a few issues. OTOH, I don't 
> really see it happening any time soon.

Is this possibly a GSoC sized project?
Alan Isaac

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Alan G Isaac
On 9/19/2008 11:46 AM Pierre GM apparently wrote:
 a.mask=True

This is great, but is apparently
new behavior as of NumPy 1.2?
Alan Isaac


___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Alan G Isaac
On 9/19/2008 1:58 PM Robert Kern apparently wrote:
> there are no objects inside non-object arrays. There is
> nothing with identity inside the arrays to compare against.

Got it.
Thanks.
Alan

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Pierre GM
On Friday 19 September 2008 14:01:13 Eric Firing wrote:
> Pierre GM wrote:

> 2) convenient interfacing with extension code in C or C++.
>
> The latter is a factor in the present use of nan in matplotlib; using
> nan for missing values in an array passed into extension code saves
> having to pass and process a second (mask) array.  It is fast and simple.

As long as you deal with floats.

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Eric Firing
Pierre GM wrote:

>> It seems to me that there are pragmatic reasons
>> why people work with NaNs for missing values,
>> that perhaps shd not be dismissed so quickly.
>> But maybe I am overlooking a simple solution.
> 
> nansomething solutions tend to be considerably faster, that might be one 
> reason. A lack of visibility of numpy.ma could be a second. In any case, I 
> can't but agree with other posters: a NaN in an array usually means something 
> went astray.

Additional reasons for using nans:

1) years of experience with Matlab, in which using nan for missing 
values is the standard idiom.
2) convenient interfacing with extension code in C or C++.

The latter is a factor in the present use of nan in matplotlib; using 
nan for missing values in an array passed into extension code saves 
having to pass and process a second (mask) array.  It is fast and simple.

Eric
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Robert Kern
On Fri, Sep 19, 2008 at 11:34, Alan G Isaac <[EMAIL PROTECTED]> wrote:
> On 9/19/2008 12:02 PM Peter Saffrey apparently wrote:
>> >>> a = array([1,2,nan])
>> >>> nan in a
>> False
>
> Huh.  I'm inclined to call this a bug,
> since normal Python behavior is that
> ``in`` should check for identity::
>
>>>> xl = [1.,np.nan]
>>>> np.nan in xl
>True

Except that there are no objects inside non-object arrays. There is
nothing with identity inside the arrays to compare against.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
 -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Alan G Isaac
On 9/19/2008 12:02 PM Peter Saffrey apparently wrote:
> >>> a = array([1,2,nan])
> >>> nan in a
> False

Huh.  I'm inclined to call this a bug,
since normal Python behavior is that
``in`` should check for identity::

>>> xl = [1.,np.nan]
>>> np.nan in xl
True

Alan

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Charles R Harris
On Fri, Sep 19, 2008 at 1:11 AM, David Cournapeau <
[EMAIL PROTECTED]> wrote:

> Anne Archibald wrote:
> >
> > Well, for example, you might ask that all the non-nan elements be in
> > order, even if you don't specify where the nan goes.
>
>
> Ah, there are two problems, then:
>- sort
>- how median use sort.
>
> For sort, I don't know how sort speed would be influenced by treating
> nan. In a way, calling sort with nan inside is a user error (if you take
> the POV nan are not comparable), but nan are used for all kind of
> purpose,


used <- misused. Using nan to flag anything but a numerical error is going
to cause problems. It wouldn't be too hard to implement nansorts, they just
need a real comparison function so that all the nans end up at on end or the
other. I don't know that that would make medians any easier, though. Are the
nans part of the data set? A nansearchsorted would probably be needed also.
If this functionality is added, the best way might be something like
kind='nanquicksort'.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Alan G Isaac
On 9/19/2008 11:46 AM Pierre GM apparently wrote:
> No, but you may do the opposite: just start with an array completely masked, 
> and unmasked it as you need:

Very useful example.
I did not understand this possibility.
Alan


___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Alan G Isaac
On 9/19/2008 11:46 AM Pierre GM apparently wrote:
> You can't compare NaNs to anything. How do you know this np.miss is a masked 
> value, when np.sqrt(-1.) is NaN ?

I thought you could use ``is``.
E.g.,
 >>> np.nan == np.nan
False
 >>> np.nan is np.nan
True

Alan

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Pierre GM
On Friday 19 September 2008 12:02:08 Peter Saffrey wrote:
> Alan G Isaac  american.edu> writes:
> > Recently I needed to fill a 2d array with values
> > from computations that could "go wrong".

> Should I take the earlier advice and switch to masked arrays?
>
> Peter

Yes. As you've noticed, you can't compare nans (after all, nans are not 
numbers...), which limits their use.
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Peter Saffrey
Alan G Isaac  american.edu> writes:

> Recently I needed to fill a 2d array with values
> from computations that could "go wrong".
> I created an array of NaN and then replaced
> the elements where the computation produced
> a useful value.  I then applied ``nanmax``,
> to get the maximum of the useful values.
> 

I'm glad you posted this, because this is exactly the method I'm using. How do
you detect whether there are still any missing spots in your array? nan has some
rather unfortunate properties:

>>> from numpy import *
>>> a = array([1,2,nan])
>>> nan in a
False
>>> nan == nan
False

Should I take the earlier advice and switch to masked arrays?

Peter

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Pierre GM
On Friday 19 September 2008 11:36:17 Alan G Isaac wrote:
> On 9/19/2008 11:09 AM Stefan Van der Walt apparently wrote:
> > Masked arrays.  Using NaN's for missing values is dangerous.  You may
> > do some operation, which generates invalid results, and then you have
> > a mixed bag of missing and invalid values.
>
> That rather evades my full question, I think?
>
> In the case I mentioned,
> I am filling an array inside a loop,
> and the possible fill values are not constrained.
> So I cannot mask based on value,
> and I cannot mask based on position
> (at least until after the computations are complete).

No, but you may do the opposite: just start with an array completely masked, 
and unmasked it as you need:
Say, you have  4x5 array, and want to unmask (0,0), (1,2), (3,4)
>>> a = ma.empty((4,5), dtype=float)
>>> a.mask=True
>>> a[0,0] = 0
>>> a[1,2]=1
>>> a[3,4]=3
>>>a 
masked_array(data =
 [[0.0 -- -- -- --]
 [-- -- 1.0 -- --]
 [-- -- -- -- --]
 [-- -- -- -- 3.0]],
  mask =
 [[False  True  True  True  True]
 [ True  True False  True  True]
 [ True  True  True  True  True]
 [ True  True  True  True False]],
  fill_value=1e+20)
>>>a.max(axis=0)
masked_array(data = [0.0 -- 1.0 -- 3.0],
  mask = [False  True False  True False],
  fill_value=1e+20)


> It seems to me that there are pragmatic reasons
> why people work with NaNs for missing values,
> that perhaps shd not be dismissed so quickly.
> But maybe I am overlooking a simple solution.

nansomething solutions tend to be considerably faster, that might be one 
reason. A lack of visibility of numpy.ma could be a second. In any case, I 
can't but agree with other posters: a NaN in an array usually means something 
went astray.

> PS I confess I do not understand NaNs.
> E.g., why could there not be a value np.miss
> that would be a NaN that represents a missing value?

You can't compare NaNs to anything. How do you know this np.miss is a masked 
value, when np.sqrt(-1.) is NaN ?




___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Alan G Isaac
On 9/19/2008 11:09 AM Stefan Van der Walt apparently wrote:
> Masked arrays.  Using NaN's for missing values is dangerous.  You may  
> do some operation, which generates invalid results, and then you have  
> a mixed bag of missing and invalid values.

That rather evades my full question, I think?

In the case I mentioned,
I am filling an array inside a loop,
and the possible fill values are not constrained.
So I cannot mask based on value,
and I cannot mask based on position
(at least until after the computations are complete).

It seems to me that there are pragmatic reasons
why people work with NaNs for missing values,
that perhaps shd not be dismissed so quickly.
But maybe I am overlooking a simple solution.

Alan

PS I confess I do not understand NaNs.
E.g., why could there not be a value np.miss
that would be a NaN that represents a missing value?
Are all NaNs already assigned standard meanings?

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Stefan Van der Walt
On 19 Sep 2008, at 16:07 , Alan G Isaac wrote:
> On 9/19/2008 4:35 AM David Cournapeau apparently wrote:
>> I never use NaN as missing value
>
> What do you use?

Masked arrays.  Using NaN's for missing values is dangerous.  You may  
do some operation, which generates invalid results, and then you have  
a mixed bag of missing and invalid values.

Cheers
Stéfan

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Alan G Isaac
On 9/19/2008 4:35 AM David Cournapeau apparently wrote:
> I never use NaN as missing value

What do you use?

Recently I needed to fill a 2d array with values
from computations that could "go wrong".
I created an array of NaN and then replaced
the elements where the computation produced
a useful value.  I then applied ``nanmax``,
to get the maximum of the useful values.

What should I have done?

Thanks,
Alan Isaac

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Stéfan van der Walt
2008/9/19 David Cournapeau <[EMAIL PROTECTED]>:
> But cannot this be fixed at the python level of the max function ? I

Why shouldn't we have "nanmin"-like behaviour for the C min itself?
I'd rather have a specialised function to deal with the rare kinds of
datasets where NaNs are guaranteed never to occur.

> But on my numpy, it looks like nan breaks min/max, they are not ignored:

Yes, that's the problem.

Cheers
Stéfan
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread David Cournapeau
Peter Saffrey wrote:
>
> I've posted my test code below, which gives me the results:
>
> $ ./arrayspeed3.py
> list build time: 0.01
> list median time: 0.01
> array nanmedian time: 0.36
>
> I must have done something wrong to hobble nanmedian in this way... I'm quite
> new to numpy, so feel free to point out any obviously egregious errors.

Ok: it is "pathological", and can be done better :)

First:

> for natest in natests:
>   thismed = nanmedian(natest, axis=1)
>   namedians.append(thismed)

^^^ Here, you are doing nanmedian on a direction with 3 elements: this
will be slow in numpy, because numpy involves some relatively heavy
machinery to run on arrays. The machinery pays off for 'big' arrays, but
for really small arrays like here, list can (and often are) be faster.

Still, it is indeed really slow for your case; when I fixed nanmean and
co, I did not know much about numpy, I just wanted them to give the
right answer :) I think this can be made faster, specially for your case
(where the axis along which the median is computed is really small).

I opened a bug:

http://scipy.org/scipy/scipy/ticket/740

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread David Cournapeau
Peter Saffrey wrote:
> Pierre GM  gmail.com> writes:
>
>> I think there were some changes on the C side of numpy between 1.0 and 1.1, 
>> you may have to recompile scipy and matplotlib from sources. What versions 
>> are you using for those 2 packages ?
>>
>
> $ dpkg -l | grep scipy
> ii  python-scipy   0.6.0-8ubuntu1 
>  
> scientific tools for Python
>
> $ dpkg -l | grep matplotlib
> ii  python-matplotlib  0.91.2-0ubuntu1
>  
> Python based plotting system in a style simi
> ii  python-matplotlib-data 0.91.2-0ubuntu1
>  
> Python based plotting system (data package)
> ii  python-matplotlib-doc  0.91.2-0ubuntu1
>  
> Python based plotting system (documentation 

If you build numpy from sources, please don't install it into /usr ! It
will more than likely break everything which depends on numpy, as well
as your debian installation (because you will overwrite packages handled
by dpkg). You should really install in a local directory, outside /usr.

You will have to install scipy and matplotlib in any case, too.

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Peter Saffrey
Pierre GM  gmail.com> writes:

> I think there were some changes on the C side of numpy between 1.0 and 1.1, 
> you may have to recompile scipy and matplotlib from sources. What versions 
> are you using for those 2 packages ?
> 

$ dpkg -l | grep scipy
ii  python-scipy   0.6.0-8ubuntu1  
scientific tools for Python

$ dpkg -l | grep matplotlib
ii  python-matplotlib  0.91.2-0ubuntu1 
Python based plotting system in a style simi
ii  python-matplotlib-data 0.91.2-0ubuntu1 
Python based plotting system (data package)
ii  python-matplotlib-doc  0.91.2-0ubuntu1 
Python based plotting system (documentation 

Peter

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Peter Saffrey
David Cournapeau  ar.media.kyoto-u.ac.jp> writes:

> It may be that nanmedian is slow. But I would sincerly be surprised if
> it were slower than python list, except for some pathological cases, or
> maybe a bug in nanmedian. What do your data look like ? (size, number of
> nan, etc...)
> 

I've posted my test code below, which gives me the results:

$ ./arrayspeed3.py
list build time: 0.01
list median time: 0.01
array nanmedian time: 0.36

I must have done something wrong to hobble nanmedian in this way... I'm quite
new to numpy, so feel free to point out any obviously egregious errors.

Peter

===

from numpy import array, nan, inf
from pylab import rand
from time import clock
from scipy.stats.stats import nanmedian

import pdb
_pdb = pdb.Pdb()
breakpoint = _pdb.set_trace

def my_median(vallist):
num_vals = len(vallist)
vallist.sort()
if num_vals % 2 == 1: # odd
index = (num_vals - 1) / 2
return vallist[index]
else: # even
index = num_vals / 2
return (vallist[index] + vallist[index - 1]) / 2

numtests = 100
testsize = 100
pointlen = 3

t0 = clock()
natests = rand(numtests,testsize,pointlen)
# have to start with inf because list.remove(nan) doesn't remove nan
natests[natests > 0.9] = inf
tests = natests.tolist()
natests[natests==inf] = nan
for test in tests:
for point in test:
if inf in point:
point.remove(inf)
t1 = clock()
print "list build time:", t1-t0


t0 = clock()
allmedians = []
for test in tests:
medians = [ my_median(x) for x in test ]
allmedians.append(medians)
t1 = clock()
print "list median time:", t1-t0

t0 = clock()
namedians = []
for natest in natests:
thismed = nanmedian(natest, axis=1)
namedians.append(thismed)
t1 = clock()
print "array nanmedian time:", t1-t0



___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread David Cournapeau
Peter Saffrey wrote:
>
> I rejoiced when I saw this answer, because it looks like a function I can just
> drop in and it works. Unfortunately, nanmedian seems to be quite a bit slower
> than just using lists (ignoring nan values from my experiments) and a 
> home-brew
> implementation of median. I was mostly using numpy for speed...

It may be that nanmedian is slow. But I would sincerly be surprised if
it were slower than python list, except for some pathological cases, or
maybe a bug in nanmedian. What do your data look like ? (size, number of
nan, etc...)

I quickly benchmarked on relatively small dataset (a few thousand
samples with a few random nan), and nanmedian is "only" a few times
slower than median.

>
> I would like to try the masked array approach, but the Ubuntu packages for 
> scipy
> and matplotlib depend on numpy. Does anybody know whether I can naively do 
> "sudo
> python setup.py install" on a more modern numpy without disturbing scipy and
> matplotlib, or do I need to uninstall all three packages and install them
> manually from source?

My advice would be to never ever install a package from source into
/usr. This will cause trouble. The way I do it is to install everything
from sources into $HOME/local (of course, any directory you have regular
write access to will do).

>
> On my 64 bit machine, the Ubuntu numpy package is even more out of date:
>
> $ dpkg -l | grep numpy
> ii  python-numpy   1:1.0.4-6ubuntu3 
>
> Does anybody know why this is?

Yes, ubuntu updates every 6 months, the last time in last April. Numpy
1.1.0 (the first version after 1.0.4) was released in May. Also, Ubuntu
updates from debian, general 4-5 months before ubuntu release data. So
even if debian were to release a package the day we release a new
package, Ubuntu will be one year late.

I personally think that the solution would be to provide our own .deb up
to date, but this is a lot of work. I think Ondrej did some work related
to that; recent tools like opensuse build service and launchpad ppa
makes it somewhat a bit easier, too (for the build part, at least; you
still need to know how to build rpm/deb).

cheers,

David

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Pierre GM
On Friday 19 September 2008 05:51:55 Peter Saffrey wrote:

> I would like to try the masked array approach, but the Ubuntu packages for
> scipy and matplotlib depend on numpy. Does anybody know whether I can
> naively do "sudo python setup.py install" on a more modern numpy without
> disturbing scipy and matplotlib, or do I need to uninstall all three
> packages and install them manually from source?

I think there were some changes on the C side of numpy between 1.0 and 1.1, 
you may have to recompile scipy and matplotlib from sources. What versions 
are you using for those 2 packages ?
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Peter Saffrey
David Cournapeau  ar.media.kyoto-u.ac.jp> writes:

> You can use nanmean (from scipy.stats):
> 

I rejoiced when I saw this answer, because it looks like a function I can just
drop in and it works. Unfortunately, nanmedian seems to be quite a bit slower
than just using lists (ignoring nan values from my experiments) and a home-brew
implementation of median. I was mostly using numpy for speed...

I would like to try the masked array approach, but the Ubuntu packages for scipy
and matplotlib depend on numpy. Does anybody know whether I can naively do "sudo
python setup.py install" on a more modern numpy without disturbing scipy and
matplotlib, or do I need to uninstall all three packages and install them
manually from source?

On my 64 bit machine, the Ubuntu numpy package is even more out of date:

$ dpkg -l | grep numpy
ii  python-numpy   1:1.0.4-6ubuntu3 

Does anybody know why this is? I might be willing to help bring the repository
up to date, if anybody can give me pointers on how to do this.

Peter

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread David Cournapeau
Stéfan van der Walt wrote:
>
> So am I.  In all my use cases, NaNs indicate trouble.

Yes, so I would like to  have the opinion of people with other usage
than ours.
>
> Because we have x.max() silently ignoring NaNs, which causes a lot of
> head-scratching, swearing and failed experiments.

But cannot this be fixed at the python level of the max function ? I
think it is expected to have the low level C functions to ignore/be
bogus if you have Nan. After all, if you use sort of the libc with nan,
or sort in C++ for a vector of double, it will not work either.

But on my numpy, it looks like nan breaks min/max, they are not ignored:

np.min(np.array([0, np.nan, 1]))
-> 1.0 # bogus

np.min(np.array([0, np.nan, 2]))
-> 2.0 # ok

np.min(np.array([0, np.nan, -1]))
-> -1.0 # ok

np.max(np.array([0, np.nan, -1]))
> -1.0 # bogus

Which only makes sense when you guess how they are implemented in C...

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Stéfan van der Walt
2008/9/19 David Cournapeau <[EMAIL PROTECTED]>:
> Stéfan van der Walt wrote:
>>
>> I agree completely.
>
> Me too, but I am extremely biased toward nan is always bogus by my own
> usage of numpy/scipy (I never use NaN as missing value, and nan is
> always caused by divide by 0 and co).

So am I.  In all my use cases, NaNs indicate trouble.

> Why ?

Because we have x.max() silently ignoring NaNs, which causes a lot of
head-scratching, swearing and failed experiments.

Cheers
Stéfan
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Pierre GM
On Friday 19 September 2008 04:31:38 David Cournapeau wrote:
> Pierre GM wrote:
> > That said, numpy.nanmin, numpy.nansum... don't come with the heavy
> > machinery of numpy.ma, and are therefore faster.
> > I'm really going to have to learn C.
>
> FWIW, nanmean/nanmean/etc... are written in python,

I know. I was more dreading the time when MaskedArrays would have to be ported 
to C. In a way, that would probably simplify a few issues. OTOH, I don't 
really see it happening any time soon.
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread David Cournapeau
Stéfan van der Walt wrote:
>
> I agree completely.

Me too, but I am extremely biased toward nan is always bogus by my own
usage of numpy/scipy (I never use NaN as missing value, and nan is
always caused by divide by 0 and co).

I like that sort raise an exception by default with NaN: it breaks the
API, OTOH, I can't see a good use of sort with NaN since sort does not
sort values in that case: we would break the API of a broken function.

>
> Unfortunately, this needs to happen at the C level. 

Why ?

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread David Cournapeau
Pierre GM wrote:
> That said, numpy.nanmin, numpy.nansum... don't come with the heavy machinery 
> of numpy.ma, and are therefore faster. 
> I'm really going to have to learn C.
>   

FWIW, nanmean/nanmean/etc... are written in python,

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Pierre GM
On Friday 19 September 2008 04:10:24 Anne Archibald wrote:

> (is there a convenience
> function that makes a masked array with a mask everywhere the data is
> nan?).

numpy.ma.fix_invalid, that masks your Nans and Infs and sets the underlying 
data to some filling value. That way, you don't carry NaNs/Infs along.

> I am assuming that appropriate masked sort/amax/maximum/mean/median
> exist already. They're definitely needed, so how much effort is it
> worth putting in to duplicate that functionality with nans instead of
> masked elements?

My opinion indeed. The MaskedArray.sort method has an extra flag that lets you 
decide whether you want masked data at the beginning or the end of your 
array.
That said, numpy.nanmin, numpy.nansum... don't come with the heavy machinery 
of numpy.ma, and are therefore faster. 
I'm really going to have to learn C.

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Stéfan van der Walt
2008/9/19 Anne Archibald <[EMAIL PROTECTED]>:
> I think the numpy attitude to nans should be that they are unexpected
> bogus values that signify that something went wrong with the
> calculation somewhere. They can be left in place for most operations,
> but any operation that depends on the value should (ideally) return
> nan, or failing that, raise an exception.

I agree completely.

> I am assuming that appropriate masked sort/amax/maximum/mean/median
> exist already. They're definitely needed, so how much effort is it
> worth putting in to duplicate that functionality with nans instead of
> masked elements?

Unfortunately, this needs to happen at the C level.  Is anyone reading
this willing to spend some time taking care of the issue?  It's an
important one.

Stéfan
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Anne Archibald
2008/9/19 Pierre GM <[EMAIL PROTECTED]>:
> On Friday 19 September 2008 03:11:05 David Cournapeau wrote:
>
>> Hm, I am always puzzled when I think about nan handling :) It always
>> seem there is not good answer.
>
> Which is why we have masked arrays, of course ;)

I think the numpy attitude to nans should be that they are unexpected
bogus values that signify that something went wrong with the
calculation somewhere. They can be left in place for most operations,
but any operation that depends on the value should (ideally) return
nan, or failing that, raise an exception. (If users want exceptions
all the time, that's what seterr is for.) If people want to flag bad
data, let's tell them to use masked arrays.

So by this rule amax/maximum/mean/median should all return nan when
there's a nan in their input; I don't think it's reasonable for sort
to return an array full of nans, so I think its default behaviour
should be to raise an exception if there's a nan. It's valuable (for
example in median) to be able to sort them all to the end, but I don't
think this should be the default. If people want nanmin, I would be
tempted to tell them to use masked arrays (is there a convenience
function that makes a masked array with a mask everywhere the data is
nan?).

I am assuming that appropriate masked sort/amax/maximum/mean/median
exist already. They're definitely needed, so how much effort is it
worth putting in to duplicate that functionality with nans instead of
masked elements?

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread Pierre GM
On Friday 19 September 2008 03:11:05 David Cournapeau wrote:

> Hm, I am always puzzled when I think about nan handling :) It always
> seem there is not good answer.

Which is why we have masked arrays, of course ;)
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-19 Thread David Cournapeau
Anne Archibald wrote:
>
> Well, for example, you might ask that all the non-nan elements be in
> order, even if you don't specify where the nan goes.


Ah, there are two problems, then:
- sort
- how median use sort.

For sort, I don't know how sort speed would be influenced by treating
nan. In a way, calling sort with nan inside is a user error (if you take
the POV nan are not comparable), but nan are used for all kind of
purpose, hence maybe having a nansort would be nice. OTOH (I took a look
at this when I fixed nanmean and co a while ago in scipy), matlab and R
treat sort differently than mean and co.

I am puzzled by this:
- R sort arrays with nan as you want by default (nan can be ignored,
put in front or at the end of the array).
- R max does not ignore nan by default.
- R median does not ignore median by default.

I don't know how to set a consistency here. I don't think we are
consistent by having max/amax/etc... ignoring nan but sort not ignoring
it. OTOH, R is not consistent either.

>
> You can always just set numpy to raise an exception whenever it comes
> across a nan. In fact, apart from the difficulty of correctly frobbing
> numpy's floating-point handling, how reasonable is it for (say) median
> to just run as it is now, but if an exception is thrown, fall back to
> a nan-aware version?

It would be different from the current nan vs usual function behavior
for median/mean/etc...: why should sort handle nan by default, but not
the other functions ? For mean/std/variance/median, if having nan is an
error, you see it in the result (once we fix our median), but not with sort.

Hm, I am always puzzled when I think about nan handling :) It always
seem there is not good answer.

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-18 Thread Anne Archibald
2008/9/19 David Cournapeau <[EMAIL PROTECTED]>:
> Anne Archibald wrote:
>>
>> That was in amax/amin. Pretty much every other function that does
>> comparisons needs to be fixed to work with nans. In some cases it's
>> not even clear how: where should a sort put the nans in an array?
>
> The problem is more on how the functions use sort than sort itself in
> the case of median. There can't be a 'good' way to put nan in soft, for
> example, since nans cannot be ordered.

Well, for example, you might ask that all the non-nan elements be in
order, even if you don't specify where the nan goes.

> I don't know about the best strategy: either we fix every function using
> comparison, handling nan as a special case as you mentioned, or there
> may be a more clever thing to do to avoid special casing everywhere. I
> don't have a clear idea of how many functions rely on ordering in numpy.

You can always just set numpy to raise an exception whenever it comes
across a nan. In fact, apart from the difficulty of correctly frobbing
numpy's floating-point handling, how reasonable is it for (say) median
to just run as it is now, but if an exception is thrown, fall back to
a nan-aware version?

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-18 Thread David Cournapeau
Anne Archibald wrote:
>
> That was in amax/amin. Pretty much every other function that does
> comparisons needs to be fixed to work with nans. In some cases it's
> not even clear how: where should a sort put the nans in an array?

The problem is more on how the functions use sort than sort itself in
the case of median. There can't be a 'good' way to put nan in soft, for
example, since nans cannot be ordered.

I don't know about the best strategy: either we fix every function using
comparison, handling nan as a special case as you mentioned, or there
may be a more clever thing to do to avoid special casing everywhere. I
don't have a clear idea of how many functions rely on ordering in numpy.

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-18 Thread Anne Archibald
2008/9/18 David Cournapeau <[EMAIL PROTECTED]>:
> Anne Archibald wrote:
>>
>> I don't think I agree:
>>
>> In [4]: np.median([1,3,nan])
>> Out[4]: 3.0
>>
>> In [5]: np.median([1,nan,3])
>> Out[5]: nan
>>
>> In [6]: np.median([nan,1,3])
>> Out[6]: 1.0
>>
>
> I was referring to the fact that if you have nan in your array, you
> should use nanmean if you want to ignore them correctly. Now, the
> different behavior depending on the order of items in the arrays is
> indeed buggy, I thought this was fixed.

That was in amax/amin. Pretty much every other function that does
comparisons needs to be fixed to work with nans. In some cases it's
not even clear how: where should a sort put the nans in an array? I
suppose some enterprising soul should write up a fileful of tests
making sure that all numpy's functions do something sane with arrays
containing nans...

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-18 Thread David Cournapeau
Anne Archibald wrote:
>
> I don't think I agree:
>
> In [4]: np.median([1,3,nan])
> Out[4]: 3.0
>
> In [5]: np.median([1,nan,3])
> Out[5]: nan
>
> In [6]: np.median([nan,1,3])
> Out[6]: 1.0
>   

I was referring to the fact that if you have nan in your array, you
should use nanmean if you want to ignore them correctly. Now, the
different behavior depending on the order of items in the arrays is
indeed buggy, I thought this was fixed.

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-18 Thread Anne Archibald
2008/9/18 David Cournapeau <[EMAIL PROTECTED]>:
> Peter Saffrey wrote:
>>
>> Is this the correct behavior for median with nan?
>
> That's the expected behavior, at least :) (this is also the expected
> behavior of most math packages I know, including matlab and R, so this
> should not be too surprising if you have used those).

I don't think I agree:

In [4]: np.median([1,3,nan])
Out[4]: 3.0

In [5]: np.median([1,nan,3])
Out[5]: nan

In [6]: np.median([nan,1,3])
Out[6]: 1.0

I think the expected behaviour would be for all of these to return nan.

Anne
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-18 Thread David Cournapeau
Peter Saffrey wrote:
>
> Is this the correct behavior for median with nan? 

That's the expected behavior, at least :) (this is also the expected
behavior of most math packages I know, including matlab and R, so this
should not be too surprising if you have used those).

> Is there a fix for 
> this or am I going to have to settle with using lists?

You can use nanmean (from scipy.stats):

>>> stats.nanmedian(np.array([1, np.nan, 3, 9]))
3

cheers,

David

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-18 Thread Charles R Harris
On Thu, Sep 18, 2008 at 12:23 PM, Pierre GM <[EMAIL PROTECTED]> wrote:

> On Thursday 18 September 2008 13:31:18 Peter Saffrey wrote:
> > The version in the Ubuntu package repository. It says 1:1.0.4-6ubuntu3.
>
> So it's 1.0 ? It's fairly old, that would explain.
>
> > > if you don't give an axis
> > > parameter, you should get the median of the flattened array, therefore
> a
> > > scalar, not an array.
> >
> > Not for my version.
>
> Indeed. Looks like the default axis changed from 0 in 1.0 to None in the
> incoming 1.2. But that's a detail at this point.
>
> > > Anyway: you should use ma.median for masked arrays. Else, you're just
> > > keeping the NaNs where they were.
> >
> > That will be the problem. My version does not have median or mean methods
> > for masked arrays, only the average() method.
>
> The method mean has always been around for masked arrays, so has the
> corresponding function. But I'm surprised, median has been in
> numpy.ma.extras
> for a while. Maybe not 1.0...
>
> > According to this page:
> >
> > http://www.scipy.org/Download
> >
> > 1.1.0 is the latest release.
>
> You need to update your internet ;) 1.1.1 was released 6 weeks ago.
>

The page had a typo, I've fixed it.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-18 Thread Pierre GM
On Thursday 18 September 2008 13:31:18 Peter Saffrey wrote:
> The version in the Ubuntu package repository. It says 1:1.0.4-6ubuntu3.

So it's 1.0 ? It's fairly old, that would explain.

> > if you don't give an axis
> > parameter, you should get the median of the flattened array, therefore a
> > scalar, not an array.
>
> Not for my version.

Indeed. Looks like the default axis changed from 0 in 1.0 to None in the 
incoming 1.2. But that's a detail at this point.

> > Anyway: you should use ma.median for masked arrays. Else, you're just
> > keeping the NaNs where they were.
>
> That will be the problem. My version does not have median or mean methods
> for masked arrays, only the average() method.

The method mean has always been around for masked arrays, so has the 
corresponding function. But I'm surprised, median has been in numpy.ma.extras 
for a while. Maybe not 1.0...

> According to this page:
>
> http://www.scipy.org/Download
>
> 1.1.0 is the latest release. 

You need to update your internet ;) 1.1.1 was released 6 weeks ago.

> Do I need to use an SVN build to get the 
> ma.median functionality?

No, you can install 1.1.1, that should work. 
Note that I just fixed a bug in median in SVN (it would fail when trying to 
get the median of a 2D array with axis=1), so you may want to check this one 
instead if you feel like it. You can still use 1.1.1 : as a quick workaround 
the forementioned bug, use ma.median(a.T, axis=0) instead of 
ma.median(a,axis=1) when working w/ 2D arrays.

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-18 Thread Charles R Harris
On Thu, Sep 18, 2008 at 11:31 AM, Peter Saffrey <[EMAIL PROTECTED]> wrote:

> Pierre GM  gmail.com> writes:
>
> > Mmh, typo?
> >
>
> Yes, apologies. I was aiming for thorough, but ended up just careless. It's
> been
> a long day.
>
> > Ohoh. What version of numpy are you using ?
>
> The version in the Ubuntu package repository. It says 1:1.0.4-6ubuntu3.
>
> > if you don't give an axis
> > parameter, you should get the median of the flattened array, therefore a
> > scalar, not an array.
>
> Not for my version.
>
> >>> a = rand(10,3)
> >>> a
> array([[ 0.1269796 ,  0.43003978,  0.4700416 ],
>   [ 0.28867077,  0.85265318,  0.35908364],
>   [ 0.72967127,  0.41856028,  0.54724918],
>   [ 0.28821876,  0.69684144,  0.54647616],
>   [ 0.09592476,  0.83704808,  0.52425368],
>   [ 0.743552  ,  0.44433314,  0.7362179 ],
>   [ 0.4283931 ,  0.13305385,  0.68422292],
>   [ 0.68860674,  0.15057373,  0.99206493],
>   [ 0.31846329,  0.77237046,  0.986883  ],
>   [ 0.4578616 ,  0.4580833 ,  0.97754176]])
> >>> median(a.T)
> array([ 0.43003978,  0.35908364,  0.54724918,  0.54647616,  0.52425368,
>0.7362179 ,  0.4283931 ,  0.68860674,  0.77237046,  0.4580833 ])
>
> > Anyway: you should use ma.median for masked arrays. Else, you're just
> keeping
> > the NaNs where they were.
> >
>
> That will be the problem. My version does not have median or mean methods
> for
> masked arrays, only the average() method.
>
> According to this page:
>
> http://www.scipy.org/Download
>
> 1.1.0 is the latest release. Do I need to use an SVN build to get the
> ma.median
> functionality?
>

1.1.1 is the latest release and 1.2 is coming out shortly.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-18 Thread Peter Saffrey
Pierre GM  gmail.com> writes:

> Mmh, typo?
>

Yes, apologies. I was aiming for thorough, but ended up just careless. It's been
a long day.
 
> Ohoh. What version of numpy are you using ? 

The version in the Ubuntu package repository. It says 1:1.0.4-6ubuntu3.

> if you don't give an axis 
> parameter, you should get the median of the flattened array, therefore a 
> scalar, not an array.

Not for my version.

>>> a = rand(10,3)
>>> a
array([[ 0.1269796 ,  0.43003978,  0.4700416 ],
   [ 0.28867077,  0.85265318,  0.35908364],
   [ 0.72967127,  0.41856028,  0.54724918],
   [ 0.28821876,  0.69684144,  0.54647616],
   [ 0.09592476,  0.83704808,  0.52425368],
   [ 0.743552  ,  0.44433314,  0.7362179 ],
   [ 0.4283931 ,  0.13305385,  0.68422292],
   [ 0.68860674,  0.15057373,  0.99206493],
   [ 0.31846329,  0.77237046,  0.986883  ],
   [ 0.4578616 ,  0.4580833 ,  0.97754176]])
>>> median(a.T)
array([ 0.43003978,  0.35908364,  0.54724918,  0.54647616,  0.52425368,
0.7362179 ,  0.4283931 ,  0.68860674,  0.77237046,  0.4580833 ])

> Anyway: you should use ma.median for masked arrays. Else, you're just keeping 
> the NaNs where they were.
> 

That will be the problem. My version does not have median or mean methods for
masked arrays, only the average() method.

According to this page:

http://www.scipy.org/Download

1.1.0 is the latest release. Do I need to use an SVN build to get the ma.median
functionality?

Peter



___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-18 Thread Pierre GM
On Thursday 18 September 2008 10:59:12 Peter Saffrey wrote:
> I had looked at masked arrays, but couldn't quite get them to work.

That's unfortunate.

>  >>> from numeric import *

Mmh, typo?

>  >>> from pylab import rand
>  >>> a = rand(10,3)
>  >>> a[a > 0.8] = nan
>  >>> m = ma.masked_array(a, isnan(a))
>  >>> m

Another way would be m = ma.masked_where(a>0.8,a)

> Remember I want medians of each triple, so I need to median the
> transposed matrix:
>  >>> median(m.T)

Ohoh. What version of numpy are you using ? if you don't give an axis 
parameter, you should get the median of the flattened array, therefore a 
scalar, not an array.
Anyway: you should use ma.median for masked arrays. Else, you're just keeping 
the NaNs where they were.
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-18 Thread paul taney

Hi,

>  >>> median(m.T)
> array([  1.e+20,   2.12298948e-01, 3.57822574e-01,

I believe  1.e+20  is a reserved value and signifys the
missing value or NaN in your case.  That"s the way it was in 
a Fortran77 package I worked with ten years ago...



--- On Thu, 9/18/08, Peter Saffrey <[EMAIL PROTECTED]> wrote:

> From: Peter Saffrey <[EMAIL PROTECTED]>
> Subject: Re: [Numpy-discussion] Medians that ignore values
> To: numpy-discussion@scipy.org
> Date: Thursday, September 18, 2008, 10:59 AM
>  physics.ucf.edu> writes:
> 
>  > Currently the only way you can handle NaNs is by
> using masked arrays.
>  > Create a mask by doing isfinite(a), then call the
> masked array
>  > median().  There's an example here:
>  >
>  > http://sd-2116.dedibox.fr/pydocweb/doc/numpy.ma/
>  >
> 
> I had looked at masked arrays, but couldn't quite get
> them to work. 
> Generating them is fine (I've randomly introduced a few
> nan values into 
> this array):
> 
>  >>> from numeric import *
>  >>> from pylab import rand
>  >>> a = rand(10,3)
>  >>> a[a > 0.8] = nan
>  >>> m = ma.masked_array(a, isnan(a))
>  >>> m
> array(data =
>   [[  5.97400164e-01   1.e+20   1.e+20]
>   [  3.34623242e-01   6.53582662e-02   2.12298948e-01]
>   [  2.11879853e-01   1.e+20   3.57822574e-01]
>   [  6.06911592e-01   1.96229341e-01   5.49953059e-02]
>   [  1.e+20   2.75493584e-01   4.70929957e-01]
>   [  2.92845118e-01   2.11261529e-02   3.49211381e-02]
>   [  7.11963636e-01   2.17277855e-01   5.45487384e-02]
>   [  5.20995579e-01   7.57676845e-01   1.e+20]
>   [  1.84189196e-01   7.58291436e-02   6.26567116e-01]
>   [  2.42083978e-01   1.e+20   2.30202562e-02]],
>mask =
>   [[False  True  True]
>   [False False False]
>   [False  True False]
>   [False False False]
>   [ True False False]
>   [False False False]
>   [False False False]
>   [False False  True]
>   [False False False]
>   [False  True False]],
>fill_value=1e+20)
> 
> 
> Remember I want medians of each triple, so I need to median
> the 
> transposed matrix:
> 
>  >>> median(m.T)
> array([  1.e+20,   2.12298948e-01,  
> 3.57822574e-01,
>   1.96229341e-01,   4.70929957e-01,  
> 3.49211381e-02,
>   2.17277855e-01,   7.57676845e-01,  
> 1.84189196e-01,
>   2.42083978e-01])
> 
> The first value is NaN, indicating that the median routine
> has failed to 
> ignore the masked values. What have I missed?
> 
> Thanks,
> 
> Peter
> ___
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-18 Thread Peter Saffrey
   physics.ucf.edu> writes:

 > Currently the only way you can handle NaNs is by using masked arrays.
 > Create a mask by doing isfinite(a), then call the masked array
 > median().  There's an example here:
 >
 > http://sd-2116.dedibox.fr/pydocweb/doc/numpy.ma/
 >

I had looked at masked arrays, but couldn't quite get them to work. 
Generating them is fine (I've randomly introduced a few nan values into 
this array):

 >>> from numeric import *
 >>> from pylab import rand
 >>> a = rand(10,3)
 >>> a[a > 0.8] = nan
 >>> m = ma.masked_array(a, isnan(a))
 >>> m
array(data =
  [[  5.97400164e-01   1.e+20   1.e+20]
  [  3.34623242e-01   6.53582662e-02   2.12298948e-01]
  [  2.11879853e-01   1.e+20   3.57822574e-01]
  [  6.06911592e-01   1.96229341e-01   5.49953059e-02]
  [  1.e+20   2.75493584e-01   4.70929957e-01]
  [  2.92845118e-01   2.11261529e-02   3.49211381e-02]
  [  7.11963636e-01   2.17277855e-01   5.45487384e-02]
  [  5.20995579e-01   7.57676845e-01   1.e+20]
  [  1.84189196e-01   7.58291436e-02   6.26567116e-01]
  [  2.42083978e-01   1.e+20   2.30202562e-02]],
   mask =
  [[False  True  True]
  [False False False]
  [False  True False]
  [False False False]
  [ True False False]
  [False False False]
  [False False False]
  [False False  True]
  [False False False]
  [False  True False]],
   fill_value=1e+20)


Remember I want medians of each triple, so I need to median the 
transposed matrix:

 >>> median(m.T)
array([  1.e+20,   2.12298948e-01,   3.57822574e-01,
  1.96229341e-01,   4.70929957e-01,   3.49211381e-02,
  2.17277855e-01,   7.57676845e-01,   1.84189196e-01,
  2.42083978e-01])

The first value is NaN, indicating that the median routine has failed to 
ignore the masked values. What have I missed?

Thanks,

Peter
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-18 Thread jh
> You might want to try isfinite() to first remove nan, +/- infinity 
> before doing that.
> numpy.median(a[numpy.isfinite(a)])

We just had this discussion a month or two ago, I think even on this
list, and continued it at the SciPy conference.

The problem with

numpy.median(a[numpy.isfinite(a)])

is that it breaks when you have a multi-dimensional array, such as an
array of 5000x3 as in this case, and take median down an axis.  The
example above flattens the array and eliminates the possibility of
taking the median down an axis in a single call, as the poster desires.

Currently the only way you can handle NaNs is by using masked arrays.
Create a mask by doing isfinite(a), then call the masked array
median().  There's an example here:

http://sd-2116.dedibox.fr/pydocweb/doc/numpy.ma/

Note that our competitor language IDL does have a /nan flag to its
single median routine, making this common task much easier in that
language than ours.

--jh--
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-18 Thread Bruce Southey
Nadav Horesh wrote:
> I think you need to use masked arrays.
>
>   Nadav
>
>
> -הודעה מקורית-
> מאת: [EMAIL PROTECTED] בשם Peter Saffrey
> נשלח: ה 18-ספטמבר-08 14:27
> אל: numpy-discussion@scipy.org
> נושא: [Numpy-discussion] Medians that ignore values
>  
> I have data from biological experiments that is represented as a list of 
> about 5000 triples. I would like to convert this to a list of the median 
> of each triple. I did some profiling and found that numpy was much about 
> 12 times faster for this application than using regular Python lists and 
> a list median implementation. I'll be performing quite a few 
> mathematical operations on these values, so using numpy arrays seems 
> sensible.
>
> The only problem is that my data has gaps in it - where an experiment 
> failed, a "triple" will not have three values. Some will have 2, 1 or 
> even no values. To keep the arrays regular so that they can be used by 
> numpy, is there some dummy value I can use to fill these gaps that will 
> be ignored by the median routine?
>
> I tried NaN for this, but as far as median is concerned, it counts as 
> infinity:
>
>  >>> from numpy import *
>  >>> median(array([1,3,nan]))
> 3.0
>  >>> median(array([1,nan,nan]))
> nan
>
> Is this the correct behavior for median with nan? Is there a fix for 
> this or am I going to have to settle with using lists?
>
> Thanks,
>
> Peter
> ___
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>   
> 
>
> ___
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>   
Hi,
The counting of infinity is correct due to the implementation of IEEE 
Standard for Binary Floating-Point for Arithmetic (IEEE 754).

You might want to try isfinite() to first remove nan, +/- infinity 
before doing that.
numpy.median(a[numpy.isfinite(a)])

Bruce
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Medians that ignore values

2008-09-18 Thread Nadav Horesh
I think you need to use masked arrays.

  Nadav


-הודעה מקורית-
מאת: [EMAIL PROTECTED] בשם Peter Saffrey
נשלח: ה 18-ספטמבר-08 14:27
אל: numpy-discussion@scipy.org
נושא: [Numpy-discussion] Medians that ignore values
 
I have data from biological experiments that is represented as a list of 
about 5000 triples. I would like to convert this to a list of the median 
of each triple. I did some profiling and found that numpy was much about 
12 times faster for this application than using regular Python lists and 
a list median implementation. I'll be performing quite a few 
mathematical operations on these values, so using numpy arrays seems 
sensible.

The only problem is that my data has gaps in it - where an experiment 
failed, a "triple" will not have three values. Some will have 2, 1 or 
even no values. To keep the arrays regular so that they can be used by 
numpy, is there some dummy value I can use to fill these gaps that will 
be ignored by the median routine?

I tried NaN for this, but as far as median is concerned, it counts as 
infinity:

 >>> from numpy import *
 >>> median(array([1,3,nan]))
3.0
 >>> median(array([1,nan,nan]))
nan

Is this the correct behavior for median with nan? Is there a fix for 
this or am I going to have to settle with using lists?

Thanks,

Peter
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

<>___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Medians that ignore values

2008-09-18 Thread Peter Saffrey
I have data from biological experiments that is represented as a list of 
about 5000 triples. I would like to convert this to a list of the median 
of each triple. I did some profiling and found that numpy was much about 
12 times faster for this application than using regular Python lists and 
a list median implementation. I'll be performing quite a few 
mathematical operations on these values, so using numpy arrays seems 
sensible.

The only problem is that my data has gaps in it - where an experiment 
failed, a "triple" will not have three values. Some will have 2, 1 or 
even no values. To keep the arrays regular so that they can be used by 
numpy, is there some dummy value I can use to fill these gaps that will 
be ignored by the median routine?

I tried NaN for this, but as far as median is concerned, it counts as 
infinity:

 >>> from numpy import *
 >>> median(array([1,3,nan]))
3.0
 >>> median(array([1,nan,nan]))
nan

Is this the correct behavior for median with nan? Is there a fix for 
this or am I going to have to settle with using lists?

Thanks,

Peter
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion