On Sun, Jun 23, 2019 at 4:07 PM Marten van Kerkwijk < m.h.vankerkw...@gmail.com> wrote:
> - If reductions/aggregations default to skipping missing elements, how is >> it be possible to express "NA propagating" versions, which are also useful, >> if slightly less common? >> > > I have been playing with using a new `Mask(np.ndarray)` class for the > mask, which does the actual mask propagation (i.e., all single-operand > ufuncs just copy the mask, binary operations do `logical_or` and reductions > do `logical.and.reduce`). This way the `Masked` class itself can generally > apply a given operation on the data and the masks separately and then > combine the two results (reductions are the exception in that `where` has > to be set). Your particular example here could be solved with a different > `Mask` class, for which reductions do `logical.or.reduce`. > I think it would be much better to use duck-typing for the mask as well, if possible, rather than a NumPy array subclass. This would facilitate using alternative mask implementations, e.g., distributed masks, sparse masks, bit-array masks, etc. Are there use-cases for propagating masks separately from data? If not, it might make sense to only define mask operations along with data, which could be much simpler. > We may want to add a standard "skipna" argument on NumPy aggregations, >> solely for the benefit of duck arrays (and dtypes with missing values). But >> that could also be a source of confusion, especially if skipna=True refers >> only "true NA" values, not including NaN, which is used as an alias for NA >> in pandas and elsewhere. >> > > It does seem `where` should suffice, no? If one wants to be super-fancy, > we could allow it to be a callable, which, if a ufunc, gets used inside the > loop (`where=np.isfinite` would be particularly useful). > Let me try to make the API issue more concrete. Suppose we have a MaskedArray with values [1, 2, NA]. How do I get: 1. The sum ignoring masked values, i.e., 3. 2. The sum that is tainted by masked values, i.e., NA. Here's how this works with existing array libraries: - With base NumPy using NaN as a sentinel value for NA, you can get (1) with np.sum and (2) with np.nansum. - With pandas and xarray, the default behavior is (1) and to get (2) you need to write array.sum(skipna=False). - With NumPy's current MaskedArray, it appears that you can only get (1). Maybe there isn't as strong a need for (2) as I thought? Your proposal would be something like np.sum(array, where=np.ones_like(array))? This seems rather verbose for a common operation. Perhaps np.sum(array, where=True) would work, making use of broadcasting? (I haven't actually checked whether this is well-defined yet.)
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion