Hi Allan, I'm not sure I would go too much by what the old MaskedArray class did. It indeed made an effort not to overwrite masked values with a new result, even to the extend of copying back masked input data elements to the output data array after an operation. But the fact that this is non-sensical if the dtype changes (or the units in an operation on quantities) suggests that this mental model simply does not work.
I think a sensible alternative mental model for the MaskedArray class is that all it does is forward any operations to the data it holds and separately propagate a mask, ORing elements together for binary operations, etc., and explicitly skipping masked elements in reductions (ideally using `where` to be as agnostic as possible about the underlying data, for which, e.g., setting masked values to `0` for `np.reduce.add` may or may not be the right thing to do - what if they are string?). With this mental picture, the underlying data are always have well-defined meaning: they have been operated on as if the mask did not exist. There then is also less reason to try to avoid getting it back to the user. As a concrete example (maybe Ben has others): in astropy we have a sigma-clipping average routine, which uses a `MaskedArray` to iteratively mask items that are too far off from the mean; here, the mask varies each iteration (an initially masked element can come back into play), but the data do not. All the best, Marten On Sat, Jun 22, 2019 at 10:54 AM Allan Haldane <allanhald...@gmail.com> wrote: > On 6/21/19 2:37 PM, Benjamin Root wrote: > > Just to note, data that is masked isn't always garbage. There are plenty > > of use-cases where one may want to temporarily apply a mask for a set of > > computation, or possibly want to apply a series of different masks to > > the data. I haven't read through this discussion deeply enough, but is > > this new class going to destroy underlying masked data? and will it be > > possible to swap out masks? > > > > Cheers! > > Ben Root > > Indeed my implementation currently feels free to clobber the data at > masked positions and makes no guarantees not to. > > I'd like to try to support reasonable use-cases like yours though. A few > thoughts: > > First, the old np.ma.MaskedArray explicitly does not promise to preserve > masked values, with a big warning in the docs. I can't recall the > examples, but I remember coming across cases where clobbering happens. > So arguably your behavior was never supported, and perhaps this means > that no-clobber behavior is difficult to reasonably support. > > Second, the old np.ma.MaskedArray avoids frequent clobbering by making > lots of copies. Therefore, in most cases you will not lose any > performance in my new MaskedArray relative to the old one by making an > explicit copy yourself. I.e, is it problematic to have to do > > >>> result = MaskedArray(data.copy(), trial_mask).sum() > > instead of > > >>> marr.mask = trial_mask > >>> result = marr.sum() > > since they have similar performance? > > Third, in the old np.ma.MaskedArray masked positions are very often > "effectively" clobbered, in the sense that they are not computed. For > example, if you do "c = a+b", and then change the mask of c, the values > at masked position of the result of (a+b) do not correspond to the sum > of the masked values in a and b. Thus, by "unmasking" c you are exposing > nonsense values, which to me seems likely to cause heisenbugs. > > > In summary, by not making no-clobber guarantees and by strictly > preventing exposure of nonsense values, I suspect that: 1. my new code > is simpler and faster by avoiding lots of copies, and forces copies to > be explicit in user code. 2. disallowing direct modification of the mask > lowers the "API surface area" making people's MaskedArray code less > buggy and easier to read: Exposure of nonsense values by "unmasking" is > one less possibility to keep in mind. > > Best, > Allan > > > > On Thu, Jun 20, 2019 at 12:44 PM Allan Haldane <allanhald...@gmail.com > > <mailto:allanhald...@gmail.com>> wrote: > > > > On 6/19/19 10:19 PM, Marten van Kerkwijk wrote: > > > Hi Allan, > > > > > > This is very impressive! I could get the tests that I wrote for my > > class > > > pass with yours using Quantity with what I would consider very > minimal > > > changes. I only could not find a good way to unmask data (I like > the > > > idea of setting the mask on some elements via `ma[item] = X`); is > this > > > on purpose? > > > > Yes, I want to make it difficult for the user to access the garbage > > values under the mask, which are often clobbered values. The only > way to > > "remove" a masked value is by replacing it with a new non-masked > value. > > > > > > > Anyway, it would seem easily at the point where I should comment > > on your > > > repository rather than in the mailing list! > > > > To make further progress on this encapsulation idea I need a more > > complete ducktype to pass into MaskedArray to test, so that's what > I'll > > work on next, when I have time. I'll either try to finish my > > ArrayCollection type, or try making a simple NDunit ducktype > > piggybacking on astropy's Unit. > > > > Best, > > Allan > > > > > > > > > > All the best, > > > > > > Marten > > > > > > > > > On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane > > <allanhald...@gmail.com <mailto:allanhald...@gmail.com> > > > <mailto:allanhald...@gmail.com <mailto:allanhald...@gmail.com>>> > > wrote: > > > > > > On 6/18/19 2:04 PM, Marten van Kerkwijk wrote: > > > > > > > > > > > > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane > > > <allanhald...@gmail.com <mailto:allanhald...@gmail.com> > > <mailto:allanhald...@gmail.com <mailto:allanhald...@gmail.com>> > > > > <mailto:allanhald...@gmail.com > > <mailto:allanhald...@gmail.com> <mailto:allanhald...@gmail.com > > <mailto:allanhald...@gmail.com>>>> > > > wrote: > > > > <snip> > > > > > > > > > This may be too much to ask from the initializer, but, > if > > > so, it still > > > > > seems most useful if it is made as easy as possible to > do, > > > say, `class > > > > > MaskedQuantity(Masked, Quantity): <very few > overrides>`. > > > > > > > > Currently MaskedArray does not accept ducktypes as > > underlying > > > arrays, > > > > but I think it shouldn't be too hard to modify it to do > so. > > > Good idea! > > > > > > > > > > > > Looking back at my trial, I see that I also never got to > > duck arrays - > > > > only ndarray subclasses - though I tried to make the code as > > > agnostic as > > > > possible. > > > > > > > > (Trial at > > > > > > > > > > https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?expand=1 > ) > > > > > > > > I already partly navigated this mixin-issue in the > > > > "MaskedArrayCollection" class, which essentially does > > > > ArrayCollection(MaskedArray(array)), and only takes > about 30 > > > lines of > > > > boilerplate. That's the backwards encapsulation order > from > > > what you want > > > > though. > > > > > > > > > > > > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * > > u.m, > > > > mask=[True, False, False])` does indeed not have a `.unit` > > attribute > > > > (and cannot represent itself...); I'm not at all sure that my > > > method of > > > > just creating a mixed class is anything but a recipe for > > disaster, > > > though! > > > > > > Based on your suggestion I worked on this a little today, and > > now my > > > MaskedArray more easily encapsulates both ducktypes and ndarray > > > subclasses (pushed to repo). Here's an example I got working > > with masked > > > units using unyt: > > > > > > [1]: from MaskedArray import X, MaskedArray, MaskedScalar > > > > > > [2]: from unyt import m, km > > > > > > [3]: import numpy as np > > > > > > [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0]) > > > > > > [5]: uarr > > > > > > MaskedArray([1., X , 3.]) > > > [6]: uarr + 1*m > > > > > > MaskedArray([1.001, X , 3.001]) > > > [7]: uarr.filled() > > > > > > unyt_array([1., 0., 3.], 'km') > > > [8]: np.concatenate([uarr, 2*uarr]).filled() > > > unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)') > > > > > > The catch is the ducktype/subclass has to rigorously follow > > numpy's > > > indexing rules, including distinguishing 0d arrays from > > scalars. For now > > > only I used unyt in the example above since it happens to be > > less strict > > > about dimensionless operations than astropy.units which trips > > up my > > > repr code. (see below for example with astropy.units). Note in > > the last > > > line I lost the dimensions, but that is because unyt does not > > handle > > > np.concatenate. To get that to work we need a true ducktype > > for units. > > > > > > The example above doesn't expose the ".units" attribute > > outside the > > > MaskedArray, and it doesn't print the units in the repr. But > > you can > > > access them using "filled". > > > > > > While I could make MaskedArray forward unknown attribute > > accesses to the > > > encapsulated array, that seems a bit dangerous/bug-prone at > first > > > glance, so probably I want to require the user to make a > > MaskedArray > > > subclass to do so. I've just started playing with that > > (probably buggy), > > > and Ive attached subclass examples for astropy.unit and unyt, > > with some > > > example output below. > > > > > > Cheers, > > > Allan > > > > > > > > > > > > Example using the attached astropy unit subclass: > > > > > > >>> from astropy.units import m, km, s > > > >>> uarr = MaskedQ(np.ones(3), units=km, mask=[0,1,0]) > > > >>> uarr > > > MaskedQ([1., X , 1.], units=km) > > > >>> uarr.units > > > km > > > >>> uarr + (1*m) > > > MaskedQ([1.001, X , 1.001], units=km) > > > >>> uarr/(1*s) > > > MaskedQ([1., X , 1.], units=km / s) > > > >>> (uarr*(1*m))[1:] > > > MaskedQ([X , 1.], units=km m) > > > >>> np.add.outer(uarr, uarr) > > > MaskedQ([[2., X , 2.], > > > [X , X , X ], > > > [2., X , 2.]], units=km) > > > >>> print(uarr) > > > [1. X 1.] km m > > > > > > Cheers, > > > Allan > > > > > > > > > > > Even if this impossible, I think it is conceptually > useful > > > to think > > > > > about what the masking class should do. My sense is > that, > > > e.g., it > > > > > should not attempt to decide when an operation > > succeeds or not, > > > > but just > > > > > "or together" input masks for regular, multiple-input > > functions, > > > > and let > > > > > the underlying arrays skip elements for reductions by > > using > > > `where` > > > > > (hey, I did implement that for a reason... ;-). In > > > particular, it > > > > > suggests one should not have things like domains and > all > > > that (I never > > > > > understood why `MaskedArray` did that). If one wants > more, > > > the class > > > > > should provide a method that updates the mask (a > sensible > > > default > > > > might > > > > > be `mask |= ~np.isfinite(result)` - here, the class > > being masked > > > > should > > > > > logically support ufuncs and functions, so it can > > decide what > > > > "isfinite" > > > > > means). > > > > > > > > I agree it would be nice to remove domains. It would > > make life > > > easier, > > > > and I could remove a lot of twiddly code! I kept it in > > for now to > > > > minimize the behavior changes from the old MaskedArray. > > > > > > > > > > > > That makes sense. Could be separated out to a > > backwards-compatibility > > > > class later. > > > > > > > > > > > > > In any case, I would think that a basic truth should > > be that > > > > everything > > > > > has a mask with a shape consistent with the data, so > > > > > 1. Each complex numbers has just one mask, and setting > > > `a.imag` with a > > > > > masked array should definitely propagate the mask. > > > > > 2. For a masked array with structured dtype, I'd > > similarly say > > > > that the > > > > > default is for a mask to have the same shape as the > array. > > > But that > > > > > something like your collection makes sense for the case > > > where one > > > > wants > > > > > to mask items in a structure. > > > > > > > > Agreed that we should have a single bool per complex or > > structured > > > > element, and the mask shape is the same as the array > shape. > > > That's how I > > > > implemented it. But there is still a problem with > > complex.imag > > > > assignment: > > > > > > > > >>> a = MaskedArray([1j, 2, X]) > > > > >>> i = a.imag > > > > >>> i[:] = MaskedArray([1, X, 1]) > > > > > > > > If we make the last line copy the mask to the original > > array, what > > > > should the real part of a[2] be? Conversely, if we don't > > copy > > > the mask, > > > > what should the imag part of a[1] be? It seems like we > might > > > "want" the > > > > masks to be OR'd instead, but then should i[2] be masked > > after > > > we just > > > > set it to 1? > > > > > > > > Ah, I see the issue now... Easiest to implement and closest > > in analogy > > > > to a regular view would be to just let it unmask a[2] (with > > > whatever is > > > > in real; user beware!). > > > > > > > > Perhaps better would be to special-case such that `imag` > > returns a > > > > read-only view of the mask. Making `imag` itself read-only > would > > > prevent > > > > possibly reasonable things like `i[np.isclose(i, 0)] = 0` - > but > > > there is > > > > no reason this should update the mask. > > > > > > > > Still, neither is really satisfactory... > > > > > > > > > > > > > > > > > p.s. I started trying to implement the above "Mixin" > > class; will > > > > try to > > > > > clean that up a bit so that at least it uses `where` > and > > > push it up. > > > > > > > > I played with "where", but didn't include it since 1.17 > > is not > > > released. > > > > To avoid duplication of effort, I've attached a diff of > > what I > > > tried. I > > > > actually get a slight slowdown of about 10% by using > > where... > > > > > > > > > > > > Your implementation is indeed quite similar to what I got in > > > > __array_ufunc__ (though one should "&" the where with ~mask). > > > > > > > > I think the main benefit is not to presume that whatever is > > underneath > > > > understands 0 or 1, i.e., avoid filling. > > > > > > > > > > > > If you make progress with the mixin, a push is welcome. I > > > imagine a > > > > problem is going to be that np.isscalar doesn't work to > > detect > > > duck > > > > scalars. > > > > > > > > I fear that in my attempts I've simply decided that only > > array scalars > > > > exist... > > > > > > > > -- Marten > > > > > > > > _______________________________________________ > > > > NumPy-Discussion mailing list > > > > NumPy-Discussion@python.org > > <mailto:NumPy-Discussion@python.org> > > <mailto:NumPy-Discussion@python.org > > <mailto:NumPy-Discussion@python.org>> > > > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > > > > > > > _______________________________________________ > > > NumPy-Discussion mailing list > > > NumPy-Discussion@python.org > > <mailto:NumPy-Discussion@python.org> > > <mailto:NumPy-Discussion@python.org > > <mailto:NumPy-Discussion@python.org>> > > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > > > > > > _______________________________________________ > > > NumPy-Discussion mailing list > > > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> > > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@python.org <mailto:NumPy-Discussion@python.org> > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion