Re: [Numpy-discussion] new MaskedArray class

2019-06-22 Thread Allan Haldane

On 6/21/19 2:37 PM, Benjamin Root wrote:

Just to note, data that is masked isn't always garbage. There are plenty
of use-cases where one may want to temporarily apply a mask for a set of
computation, or possibly want to apply a series of different masks to
the data. I haven't read through this discussion deeply enough, but is
this new class going to destroy underlying masked data? and will it be
possible to swap out masks?

Cheers!
Ben Root


Indeed my implementation currently feels free to clobber the data at 
masked positions and makes no guarantees not to.


I'd like to try to support reasonable use-cases like yours though. A few 
thoughts:


First, the old np.ma.MaskedArray explicitly does not promise to preserve 
masked values, with a big warning in the docs. I can't recall the 
examples, but I remember coming across cases where clobbering happens. 
So arguably your behavior was never supported, and perhaps this means 
that no-clobber behavior is difficult to reasonably support.


Second, the old np.ma.MaskedArray avoids frequent clobbering by making 
lots of copies. Therefore, in most cases you will not lose any 
performance in my new MaskedArray relative to the old one by making an 
explicit copy yourself. I.e, is it problematic to have to do


>>> result = MaskedArray(data.copy(), trial_mask).sum()

instead of

>>> marr.mask = trial_mask
>>> result = marr.sum()

since they have similar performance?

Third, in the old np.ma.MaskedArray masked positions are very often 
"effectively" clobbered, in the sense that they are not computed. For 
example, if you do "c = a+b", and then change the mask of c, the values 
at masked position of the result of (a+b) do not correspond to the sum 
of the masked values in a and b. Thus, by "unmasking" c you are exposing 
nonsense values, which to me seems likely to cause heisenbugs.



In summary, by not making no-clobber guarantees and by strictly 
preventing exposure of nonsense values, I suspect that: 1. my new code 
is simpler and faster by avoiding lots of copies, and forces copies to 
be explicit in user code. 2. disallowing direct modification of the mask 
lowers the "API surface area" making people's MaskedArray code less 
buggy and easier to read: Exposure of nonsense values by "unmasking" is 
one less possibility to keep in mind.


Best,
Allan



On Thu, Jun 20, 2019 at 12:44 PM Allan Haldane mailto:allanhald...@gmail.com>> wrote:

On 6/19/19 10:19 PM, Marten van Kerkwijk wrote:
> Hi Allan,
>
> This is very impressive! I could get the tests that I wrote for my
class
> pass with yours using Quantity with what I would consider very minimal
> changes. I only could not find a good way to unmask data (I like the
> idea of setting the mask on some elements via `ma[item] = X`); is this
> on purpose?

Yes, I want to make it difficult for the user to access the garbage
values under the mask, which are often clobbered values. The only way to
"remove" a masked value is by replacing it with a new non-masked value.


> Anyway, it would seem easily at the point where I should comment
on your
> repository rather than in the mailing list!

To make further progress on this encapsulation idea I need a more
complete ducktype to pass into MaskedArray to test, so that's what I'll
work on next, when I have time. I'll either try to finish my
ArrayCollection type, or try making a simple NDunit ducktype
piggybacking on astropy's Unit.

Best,
Allan


>
> All the best,
>
> Marten
>
>
> On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane
mailto:allanhald...@gmail.com>
> >>
wrote:
>
>     On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:
>     >
>     >
>     > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane
>     mailto:allanhald...@gmail.com>
>
>     >       wrote:
>     > 
>     >
>     >     > This may be too much to ask from the initializer, but, if
>     so, it still
>     >     > seems most useful if it is made as easy as possible to do,
>     say, `class
>     >     > MaskedQuantity(Masked, Quantity): `.
>     >
>     >     Currently MaskedArray does not accept ducktypes as
underlying
>     arrays,
>     >     but I think it shouldn't be too hard to modify it to do so.
>     Good idea!
>     >
>     >
>     > Looking back at my trial, I see that I also never got to
duck arrays -
>     > only ndarray subclasses - though I tried to make the code as
>     agnostic as
>     > possible.
>     >
>     > (Trial at
>     >
>   
 https://github.com/astropy/astropy/comp

Re: [Numpy-discussion] new MaskedArray class

2019-06-22 Thread Marten van Kerkwijk
Hi Allan,

I'm not sure I would go too much by what the old MaskedArray class did. It
indeed made an effort not to overwrite masked values with a new result,
even to the extend of copying back masked input data elements to the output
data array after an operation. But the fact that this is non-sensical if
the dtype changes (or the units in an operation on quantities) suggests
that this mental model simply does not work.

I think a sensible alternative mental model for the MaskedArray class is
that all it does is forward any operations to the data it holds and
separately propagate a mask, ORing elements together for binary operations,
etc., and explicitly skipping masked elements in reductions (ideally using
`where` to be as agnostic as possible about the underlying data, for which,
e.g., setting masked values to `0` for `np.reduce.add` may or may not be
the right thing to do - what if they are string?).

With this mental picture, the underlying data are always have well-defined
meaning: they have been operated on as if the mask did not exist. There
then is also less reason to try to avoid getting it back to the user.

As a concrete example (maybe Ben has others): in astropy we have a
sigma-clipping average routine, which uses a `MaskedArray` to iteratively
mask items that are too far off from the mean; here, the mask varies each
iteration (an initially masked element can come back into play), but the
data do not.

All the best,

Marten

On Sat, Jun 22, 2019 at 10:54 AM Allan Haldane 
wrote:

> On 6/21/19 2:37 PM, Benjamin Root wrote:
> > Just to note, data that is masked isn't always garbage. There are plenty
> > of use-cases where one may want to temporarily apply a mask for a set of
> > computation, or possibly want to apply a series of different masks to
> > the data. I haven't read through this discussion deeply enough, but is
> > this new class going to destroy underlying masked data? and will it be
> > possible to swap out masks?
> >
> > Cheers!
> > Ben Root
>
> Indeed my implementation currently feels free to clobber the data at
> masked positions and makes no guarantees not to.
>
> I'd like to try to support reasonable use-cases like yours though. A few
> thoughts:
>
> First, the old np.ma.MaskedArray explicitly does not promise to preserve
> masked values, with a big warning in the docs. I can't recall the
> examples, but I remember coming across cases where clobbering happens.
> So arguably your behavior was never supported, and perhaps this means
> that no-clobber behavior is difficult to reasonably support.
>
> Second, the old np.ma.MaskedArray avoids frequent clobbering by making
> lots of copies. Therefore, in most cases you will not lose any
> performance in my new MaskedArray relative to the old one by making an
> explicit copy yourself. I.e, is it problematic to have to do
>
>  >>> result = MaskedArray(data.copy(), trial_mask).sum()
>
> instead of
>
>  >>> marr.mask = trial_mask
>  >>> result = marr.sum()
>
> since they have similar performance?
>
> Third, in the old np.ma.MaskedArray masked positions are very often
> "effectively" clobbered, in the sense that they are not computed. For
> example, if you do "c = a+b", and then change the mask of c, the values
> at masked position of the result of (a+b) do not correspond to the sum
> of the masked values in a and b. Thus, by "unmasking" c you are exposing
> nonsense values, which to me seems likely to cause heisenbugs.
>
>
> In summary, by not making no-clobber guarantees and by strictly
> preventing exposure of nonsense values, I suspect that: 1. my new code
> is simpler and faster by avoiding lots of copies, and forces copies to
> be explicit in user code. 2. disallowing direct modification of the mask
> lowers the "API surface area" making people's MaskedArray code less
> buggy and easier to read: Exposure of nonsense values by "unmasking" is
> one less possibility to keep in mind.
>
> Best,
> Allan
>
>
> > On Thu, Jun 20, 2019 at 12:44 PM Allan Haldane  > > wrote:
> >
> > On 6/19/19 10:19 PM, Marten van Kerkwijk wrote:
> > > Hi Allan,
> > >
> > > This is very impressive! I could get the tests that I wrote for my
> > class
> > > pass with yours using Quantity with what I would consider very
> minimal
> > > changes. I only could not find a good way to unmask data (I like
> the
> > > idea of setting the mask on some elements via `ma[item] = X`); is
> this
> > > on purpose?
> >
> > Yes, I want to make it difficult for the user to access the garbage
> > values under the mask, which are often clobbered values. The only
> way to
> > "remove" a masked value is by replacing it with a new non-masked
> value.
> >
> >
> > > Anyway, it would seem easily at the point where I should comment
> > on your
> > > repository rather than in the mailing list!
> >
> > To make further progress on this encapsulation idea I need a more
> > complete duckt

Re: [Numpy-discussion] new MaskedArray class

2019-06-22 Thread Benjamin Root
"""Third, in the old np.ma.MaskedArray masked positions are very often
"effectively" clobbered, in the sense that they are not computed. For
example, if you do "c = a+b", and then change the mask of c"""

My use-cases don't involve changing the mask of "c". It would involve
changing the mask of "a" or "b" after I have calculated "c", so that I
could calculate "d". As a fairly simple example, I frequently work with
satellite data. We have multiple masks, such as water, vegetation, sandy
loam, bare rock, etc. The underlying satellite data in any of these places
isn't bad, they just need to be dealt with differently.  I wouldn't want
the act of applying a mask for a set of calculations on things that aren't
bare rock to mess up my subsequent calculation on things that aren't water.
Right now, I have to handle this explicitly with flattened sparse arrays,
which makes visualization and conception difficult.

Ben Root

On Sat, Jun 22, 2019 at 11:51 AM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> Hi Allan,
>
> I'm not sure I would go too much by what the old MaskedArray class did. It
> indeed made an effort not to overwrite masked values with a new result,
> even to the extend of copying back masked input data elements to the output
> data array after an operation. But the fact that this is non-sensical if
> the dtype changes (or the units in an operation on quantities) suggests
> that this mental model simply does not work.
>
> I think a sensible alternative mental model for the MaskedArray class is
> that all it does is forward any operations to the data it holds and
> separately propagate a mask, ORing elements together for binary operations,
> etc., and explicitly skipping masked elements in reductions (ideally using
> `where` to be as agnostic as possible about the underlying data, for which,
> e.g., setting masked values to `0` for `np.reduce.add` may or may not be
> the right thing to do - what if they are string?).
>
> With this mental picture, the underlying data are always have well-defined
> meaning: they have been operated on as if the mask did not exist. There
> then is also less reason to try to avoid getting it back to the user.
>
> As a concrete example (maybe Ben has others): in astropy we have a
> sigma-clipping average routine, which uses a `MaskedArray` to iteratively
> mask items that are too far off from the mean; here, the mask varies each
> iteration (an initially masked element can come back into play), but the
> data do not.
>
> All the best,
>
> Marten
>
> On Sat, Jun 22, 2019 at 10:54 AM Allan Haldane 
> wrote:
>
>> On 6/21/19 2:37 PM, Benjamin Root wrote:
>> > Just to note, data that is masked isn't always garbage. There are plenty
>> > of use-cases where one may want to temporarily apply a mask for a set of
>> > computation, or possibly want to apply a series of different masks to
>> > the data. I haven't read through this discussion deeply enough, but is
>> > this new class going to destroy underlying masked data? and will it be
>> > possible to swap out masks?
>> >
>> > Cheers!
>> > Ben Root
>>
>> Indeed my implementation currently feels free to clobber the data at
>> masked positions and makes no guarantees not to.
>>
>> I'd like to try to support reasonable use-cases like yours though. A few
>> thoughts:
>>
>> First, the old np.ma.MaskedArray explicitly does not promise to preserve
>> masked values, with a big warning in the docs. I can't recall the
>> examples, but I remember coming across cases where clobbering happens.
>> So arguably your behavior was never supported, and perhaps this means
>> that no-clobber behavior is difficult to reasonably support.
>>
>> Second, the old np.ma.MaskedArray avoids frequent clobbering by making
>> lots of copies. Therefore, in most cases you will not lose any
>> performance in my new MaskedArray relative to the old one by making an
>> explicit copy yourself. I.e, is it problematic to have to do
>>
>>  >>> result = MaskedArray(data.copy(), trial_mask).sum()
>>
>> instead of
>>
>>  >>> marr.mask = trial_mask
>>  >>> result = marr.sum()
>>
>> since they have similar performance?
>>
>> Third, in the old np.ma.MaskedArray masked positions are very often
>> "effectively" clobbered, in the sense that they are not computed. For
>> example, if you do "c = a+b", and then change the mask of c, the values
>> at masked position of the result of (a+b) do not correspond to the sum
>> of the masked values in a and b. Thus, by "unmasking" c you are exposing
>> nonsense values, which to me seems likely to cause heisenbugs.
>>
>>
>> In summary, by not making no-clobber guarantees and by strictly
>> preventing exposure of nonsense values, I suspect that: 1. my new code
>> is simpler and faster by avoiding lots of copies, and forces copies to
>> be explicit in user code. 2. disallowing direct modification of the mask
>> lowers the "API surface area" making people's MaskedArray code less
>> buggy and easier to read: Exposure of nonsense v

Re: [Numpy-discussion] new MaskedArray class

2019-06-22 Thread Stephan Hoyer
On Thu, Jun 20, 2019 at 7:44 PM Allan Haldane 
wrote:

> On 6/19/19 10:19 PM, Marten van Kerkwijk wrote:
> > Hi Allan,
> >
> > This is very impressive! I could get the tests that I wrote for my class
> > pass with yours using Quantity with what I would consider very minimal
> > changes. I only could not find a good way to unmask data (I like the
> > idea of setting the mask on some elements via `ma[item] = X`); is this
> > on purpose?
>
> Yes, I want to make it difficult for the user to access the garbage
> values under the mask, which are often clobbered values. The only way to
> "remove" a masked value is by replacing it with a new non-masked value.
>

I think we should make it possible to access (and even mutate) data under
the mask directly, while noting the lack of any guarantees about what those
values are.

MaskedArray has a minimal and transparent data model, consisting of data
and mask arrays. There are plenty of use cases where it is convenient to
access the underlying arrays directly, e.g., for efficient implementation
of low-level MaskedArray algorithms.

NumPy itself does a similar thing on ndarray by exposing data/strides.
Advanced users who learn the details of the data model find them useful,
and everyone else ignores them.


>
> > Anyway, it would seem easily at the point where I should comment on your
> > repository rather than in the mailing list!
>
> To make further progress on this encapsulation idea I need a more
> complete ducktype to pass into MaskedArray to test, so that's what I'll
> work on next, when I have time. I'll either try to finish my
> ArrayCollection type, or try making a simple NDunit ducktype
> piggybacking on astropy's Unit.
>

dask.array would be another good example to try. I think it already should
support __array_function__ (and if not it should be very easy to add).


> Best,
> Allan
>
>
> >
> > All the best,
> >
> > Marten
> >
> >
> > On Wed, Jun 19, 2019 at 5:45 PM Allan Haldane  > > wrote:
> >
> > On 6/18/19 2:04 PM, Marten van Kerkwijk wrote:
> > >
> > >
> > > On Tue, Jun 18, 2019 at 12:55 PM Allan Haldane
> > mailto:allanhald...@gmail.com>
> > > >>
> > wrote:
> > > 
> > >
> > > > This may be too much to ask from the initializer, but, if
> > so, it still
> > > > seems most useful if it is made as easy as possible to do,
> > say, `class
> > > > MaskedQuantity(Masked, Quantity): `.
> > >
> > > Currently MaskedArray does not accept ducktypes as underlying
> > arrays,
> > > but I think it shouldn't be too hard to modify it to do so.
> > Good idea!
> > >
> > >
> > > Looking back at my trial, I see that I also never got to duck
> arrays -
> > > only ndarray subclasses - though I tried to make the code as
> > agnostic as
> > > possible.
> > >
> > > (Trial at
> > >
> >
> https://github.com/astropy/astropy/compare/master...mhvk:utils-masked-class?expand=1
> )
> > >
> > > I already partly navigated this mixin-issue in the
> > > "MaskedArrayCollection" class, which essentially does
> > > ArrayCollection(MaskedArray(array)), and only takes about 30
> > lines of
> > > boilerplate. That's the backwards encapsulation order from
> > what you want
> > > though.
> > >
> > >
> > > Yes, indeed, from a quick trial `MaskedArray(np.arange(3.) * u.m,
> > > mask=[True, False, False])` does indeed not have a `.unit`
> attribute
> > > (and cannot represent itself...); I'm not at all sure that my
> > method of
> > > just creating a mixed class is anything but a recipe for disaster,
> > though!
> >
> > Based on your suggestion I worked on this a little today, and now my
> > MaskedArray more easily encapsulates both ducktypes and ndarray
> > subclasses (pushed to repo). Here's an example I got working with
> masked
> > units using unyt:
> >
> > [1]: from MaskedArray import X, MaskedArray, MaskedScalar
> >
> > [2]: from unyt import m, km
> >
> > [3]: import numpy as np
> >
> > [4]: uarr = MaskedArray([1., 2., 3.]*km, mask=[0,1,0])
> >
> > [5]: uarr
> >
> > MaskedArray([1., X , 3.])
> > [6]: uarr + 1*m
> >
> > MaskedArray([1.001, X, 3.001])
> > [7]: uarr.filled()
> >
> > unyt_array([1., 0., 3.], 'km')
> > [8]: np.concatenate([uarr, 2*uarr]).filled()
> > unyt_array([1., 0., 3., 2., 0., 6.], '(dimensionless)')
> >
> > The catch is the ducktype/subclass has to rigorously follow numpy's
> > indexing rules, including distinguishing 0d arrays from scalars. For
> now
> > only I used unyt in the example above since it happens to be less
> strict
> >  about dimensionless operations than astropy.units which trips up my
> > repr code. (see below for example with astropy.units). Note in the
> last
> > line I