Re: [Numpy-discussion] new MaskedArray class

Stephan Hoyer Sun, 23 Jun 2019 09:07:05 -0700

On Sun, Jun 23, 2019 at 4:07 PM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:


> - If reductions/aggregations default to skipping missing elements, how is
>> it be possible to express "NA propagating" versions, which are also useful,
>> if slightly less common?
>>
>
> I have been playing with using a new `Mask(np.ndarray)` class for the
> mask, which does the actual mask propagation (i.e., all single-operand
> ufuncs just copy the mask, binary operations do `logical_or` and reductions
> do `logical.and.reduce`). This way the `Masked` class itself can generally
> apply a given operation on the data and the masks separately and then
> combine the two results (reductions are the exception in that `where` has
> to be set). Your particular example here could be solved with a different
> `Mask` class, for which reductions do `logical.or.reduce`.
>

I think it would be much better to use duck-typing for the mask as well, if
possible, rather than a NumPy array subclass. This would facilitate using
alternative mask implementations, e.g., distributed masks, sparse masks,
bit-array masks, etc.

Are there use-cases for propagating masks separately from data? If not, it
might make sense to only define mask operations along with data, which
could be much simpler.


> We may want to add a standard "skipna" argument on NumPy aggregations,
>> solely for the benefit of duck arrays (and dtypes with missing values). But
>> that could also be a source of confusion, especially if skipna=True refers
>> only "true NA" values, not including NaN, which is used as an alias for NA
>> in pandas and elsewhere.
>>
>
> It does seem `where` should suffice, no? If one wants to be super-fancy,
> we could allow it to be a callable, which, if a ufunc, gets used inside the
> loop (`where=np.isfinite` would be particularly useful).
>

Let me try to make the API issue more concrete. Suppose we have a
MaskedArray with values [1, 2, NA]. How do I get:
1. The sum ignoring masked values, i.e., 3.
2. The sum that is tainted by masked values, i.e., NA.

Here's how this works with existing array libraries:
- With base NumPy using NaN as a sentinel value for NA, you can get (1)
with np.sum and (2) with np.nansum.
- With pandas and xarray, the default behavior is (1) and to get (2) you
need to write array.sum(skipna=False).
- With NumPy's current MaskedArray, it appears that you can only get (1).
Maybe there isn't as strong a need for (2) as I thought?

Your proposal would be something like np.sum(array,
where=np.ones_like(array))? This seems rather verbose for a common
operation. Perhaps np.sum(array, where=True) would work, making use of
broadcasting? (I haven't actually checked whether this is well-defined yet.)

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] new MaskedArray class

Reply via email to