Re: [Numpy-discussion] new MaskedArray class

2019-06-23 Thread Stephan Hoyer
On Sat, Jun 22, 2019 at 6:50 PM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> Hi Allan,
>
> I'm not sure I would go too much by what the old MaskedArray class did. It
> indeed made an effort not to overwrite masked values with a new result,
> even to the extend of copying back masked input data elements to the output
> data array after an operation. But the fact that this is non-sensical if
> the dtype changes (or the units in an operation on quantities) suggests
> that this mental model simply does not work.
>
> I think a sensible alternative mental model for the MaskedArray class is
> that all it does is forward any operations to the data it holds and
> separately propagate a mask, ORing elements together for binary operations,
> etc., and explicitly skipping masked elements in reductions (ideally using
> `where` to be as agnostic as possible about the underlying data, for which,
> e.g., setting masked values to `0` for `np.reduce.add` may or may not be
> the right thing to do - what if they are string?).
>

+1, this sounds like the right model to me.

That said, I would still not guarantee values under the mask as part of
NumPy's API. The result of computations under the mask should be considered
an undefined implementation detail, sort of like integer overflow or dict
iteration order pre-Python 3.7. The values may even be entirely arbitrary,
e.g., in cases where the result is preallocated with empty().

I'm less confident about the right way to handle missing elements in
reductions. For example:
- Should median() also skip missing elements, even though there is no
identity element?
- If reductions/aggregations default to skipping missing elements, how is
it be possible to express "NA propagating" versions, which are also useful,
if slightly less common?

We may want to add a standard "skipna" argument on NumPy aggregations,
solely for the benefit of duck arrays (and dtypes with missing values). But
that could also be a source of confusion, especially if skipna=True refers
only "true NA" values, not including NaN, which is used as an alias for NA
in pandas and elsewhere.

With this mental picture, the underlying data are always have well-defined
> meaning: they have been operated on as if the mask did not exist. There
> then is also less reason to try to avoid getting it back to the user.
>
> As a concrete example (maybe Ben has others): in astropy we have a
> sigma-clipping average routine, which uses a `MaskedArray` to iteratively
> mask items that are too far off from the mean; here, the mask varies each
> iteration (an initially masked element can come back into play), but the
> data do not.
>
> All the best,
>
> Marten
>
> On Sat, Jun 22, 2019 at 10:54 AM Allan Haldane 
> wrote:
>
>> On 6/21/19 2:37 PM, Benjamin Root wrote:
>> > Just to note, data that is masked isn't always garbage. There are plenty
>> > of use-cases where one may want to temporarily apply a mask for a set of
>> > computation, or possibly want to apply a series of different masks to
>> > the data. I haven't read through this discussion deeply enough, but is
>> > this new class going to destroy underlying masked data? and will it be
>> > possible to swap out masks?
>> >
>> > Cheers!
>> > Ben Root
>>
>> Indeed my implementation currently feels free to clobber the data at
>> masked positions and makes no guarantees not to.
>>
>> I'd like to try to support reasonable use-cases like yours though. A few
>> thoughts:
>>
>> First, the old np.ma.MaskedArray explicitly does not promise to preserve
>> masked values, with a big warning in the docs. I can't recall the
>> examples, but I remember coming across cases where clobbering happens.
>> So arguably your behavior was never supported, and perhaps this means
>> that no-clobber behavior is difficult to reasonably support.
>>
>> Second, the old np.ma.MaskedArray avoids frequent clobbering by making
>> lots of copies. Therefore, in most cases you will not lose any
>> performance in my new MaskedArray relative to the old one by making an
>> explicit copy yourself. I.e, is it problematic to have to do
>>
>>  >>> result = MaskedArray(data.copy(), trial_mask).sum()
>>
>> instead of
>>
>>  >>> marr.mask = trial_mask
>>  >>> result = marr.sum()
>>
>> since they have similar performance?
>>
>> Third, in the old np.ma.MaskedArray masked positions are very often
>> "effectively" clobbered, in the sense that they are not computed. For
>> example, if you do "c = a+b", and then change the mask of c, the values
>> at masked position of the result of (a+b) do not correspond to the sum
>> of the masked values in a and b. Thus, by "unmasking" c you are exposing
>> nonsense values, which to me seems likely to cause heisenbugs.
>>
>>
>> In summary, by not making no-clobber guarantees and by strictly
>> preventing exposure of nonsense values, I suspect that: 1. my new code
>> is simpler and faster by avoiding lots of copies, and forces copies to
>> be explicit in user code. 2. disallowi

Re: [Numpy-discussion] new MaskedArray class

2019-06-23 Thread Marten van Kerkwijk
> I think a sensible alternative mental model for the MaskedArray class is
>> that all it does is forward any operations to the data it holds and
>> separately propagate a mask, ORing elements together for binary operations,
>> etc., and explicitly skipping masked elements in reductions (ideally using
>> `where` to be as agnostic as possible about the underlying data, for which,
>> e.g., setting masked values to `0` for `np.reduce.add` may or may not be
>> the right thing to do - what if they are string?).
>>
>
> +1, this sounds like the right model to me.
>

One small worry is about name clashes - ideally one wants the masked array
to be somewhat of a drop-in for whatever class it is masking (independently
of whether it is an actual subclass of it). In this respect, `.data` is a
pretty terrible name (and, yes, the cause of lots of problems for astropy's
MaskedColumn - not my fault, that one!). In my own trials, thinking that
names that include "mask" are fair game, I've been considering a function
`.unmask(fill_value=None)` which would replace both `.filled(fill_value)`
and `.data` by having the default be not to fill anything (I don't see why
a masked array should carry a fill value along; one might use specific
strings such as 'minmax' for auto-generated cases). If wanted, one could
then add `unmasked = property(unmask)`.

Aside: my sense is to, at first at least, feel as unbound as possible from
the current MaskedArray - one could then use whatever it is to try to
create something that comes close to reproducing it, but only for ease of
transition.


> That said, I would still not guarantee values under the mask as part of
> NumPy's API. The result of computations under the mask should be considered
> an undefined implementation detail, sort of like integer overflow or dict
> iteration order pre-Python 3.7. The values may even be entirely arbitrary,
> e.g., in cases where the result is preallocated with empty().
>

I think that is reasonable. The use cases Ben and I described both are ones
where the array is being used as input for a set of computations which
differ only in their mask. (Admittedly, in both our cases one could just
reinitialize a masked array with the new mask; but I think we share the
mental model of that if I don't operate on the masked array, the data
doesn't change, so I should just be able to change the mask.)


> I'm less confident about the right way to handle missing elements in
> reductions. For example:
> - Should median() also skip missing elements, even though there is no
> identity element?
>

I think so. If for mean(), std(), etc., the number of unmasked elements
comes into play, I don't see why it wouldn't for median().

- If reductions/aggregations default to skipping missing elements, how is
> it be possible to express "NA propagating" versions, which are also useful,
> if slightly less common?
>

I have been playing with using a new `Mask(np.ndarray)` class for the mask,
which does the actual mask propagation (i.e., all single-operand ufuncs
just copy the mask, binary operations do `logical_or` and reductions do
`logical.and.reduce`). This way the `Masked` class itself can generally
apply a given operation on the data and the masks separately and then
combine the two results (reductions are the exception in that `where` has
to be set). Your particular example here could be solved with a different
`Mask` class, for which reductions do `logical.or.reduce`.

A larger issue is the accumulations. Personally, I think those are
basically meaningless for masked arrays, as to me logically the result on
the position of any masked item should be masked. But, if so, I do not see
how the ones "beyond" it could not be masked as well. Since here the right
answers seems at least unclear, my sense would be to refuse the temptation
to guess (i.e., the user should just explicitly fill with ufunc.identity if
this is the right thing to do in their case).

I should add that I'm slightly torn about a similar, somewhat related
issue: what should `np.minimum(a, b)` do for the case where either a or b
is masked? Currently, one just treats this as a bin-op, so the result is
masked, but one could argue that this ufunc is a bit like a 2-element
reduction, and thus that the unmasked item should "win by default".
Possibly, the answer should be different between `np.minimum` and `np.fmin`
(since the two differ in how they propagate `NaN` as well - note that you
don't include `fmin` and `fmax` in your coverage).

We may want to add a standard "skipna" argument on NumPy aggregations,
> solely for the benefit of duck arrays (and dtypes with missing values). But
> that could also be a source of confusion, especially if skipna=True refers
> only "true NA" values, not including NaN, which is used as an alias for NA
> in pandas and elsewhere.
>

It does seem `where` should suffice, no? If one wants to be super-fancy, we
could allow it to be a callable, which, if a ufunc, gets used inside the
loop (`where=np

Re: [Numpy-discussion] new MaskedArray class

2019-06-23 Thread Aldcroft, Thomas
On Sat, Jun 22, 2019 at 11:51 AM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> Hi Allan,
>
> I'm not sure I would go too much by what the old MaskedArray class did. It
> indeed made an effort not to overwrite masked values with a new result,
> even to the extend of copying back masked input data elements to the output
> data array after an operation. But the fact that this is non-sensical if
> the dtype changes (or the units in an operation on quantities) suggests
> that this mental model simply does not work.
>
> I think a sensible alternative mental model for the MaskedArray class is
> that all it does is forward any operations to the data it holds and
> separately propagate a mask,
>

I'm generally on-board with that mental picture, and agree that the
use-case described by Ben (different layers of satellite imagery) is
important.  Same thing happens in astronomy data, e.g. you have a CCD image
of the sky and there are cosmic rays that contaminate the image.  Those are
not garbage data, just pixels that one wants to ignore in some, but not
all, contexts.

However, it's worth noting that one cannot blindly forward any operations
to the data it holds since the operation may be illegal on that data.  The
simplest example is dividing `a / b` where  `b` has data values of 0 but
they are masked.  That operation should succeed with no exception, and here
the resultant value under the mask is genuinely garbage.

The current MaskedArray seems a bit inconsistent in dealing with invalid
calcuations.  Dividing by 0 (if masked) is no problem and returns the
numerator.  Taking the log of a masked 0 gives the usual divide by zero
RuntimeWarning and puts a 1.0 under the mask of the output.

Perhaps the expression should not even be evaluated on elements where the
output mask is True, and all the masked output data values should be set to
a predictable value (e.g. zero for numerical, zero-length string for
string, or maybe a default fill value).  That at least provides consistent
and predictable behavior that is simple to explain.  Otherwise the story is
that the data under the mask *might* be OK, unless for a particular element
the computation was invalid in which case it is filled with some arbitrary
value.  I think that is actually an error-prone behavior that should be
avoided.

- Tom


> ORing elements together for binary operations, etc., and explicitly
> skipping masked elements in reductions (ideally using `where` to be as
> agnostic as possible about the underlying data, for which, e.g., setting
> masked values to `0` for `np.reduce.add` may or may not be the right thing
> to do - what if they are string?).
>
> With this mental picture, the underlying data are always have well-defined
> meaning: they have been operated on as if the mask did not exist. There
> then is also less reason to try to avoid getting it back to the user.
>
> As a concrete example (maybe Ben has others): in astropy we have a
> sigma-clipping average routine, which uses a `MaskedArray` to iteratively
> mask items that are too far off from the mean; here, the mask varies each
> iteration (an initially masked element can come back into play), but the
> data do not.
>
> All the best,
>
> Marten
>
> On Sat, Jun 22, 2019 at 10:54 AM Allan Haldane 
> wrote:
>
>> On 6/21/19 2:37 PM, Benjamin Root wrote:
>> > Just to note, data that is masked isn't always garbage. There are plenty
>> > of use-cases where one may want to temporarily apply a mask for a set of
>> > computation, or possibly want to apply a series of different masks to
>> > the data. I haven't read through this discussion deeply enough, but is
>> > this new class going to destroy underlying masked data? and will it be
>> > possible to swap out masks?
>> >
>> > Cheers!
>> > Ben Root
>>
>> Indeed my implementation currently feels free to clobber the data at
>> masked positions and makes no guarantees not to.
>>
>> I'd like to try to support reasonable use-cases like yours though. A few
>> thoughts:
>>
>> First, the old np.ma.MaskedArray explicitly does not promise to preserve
>> masked values, with a big warning in the docs. I can't recall the
>> examples, but I remember coming across cases where clobbering happens.
>> So arguably your behavior was never supported, and perhaps this means
>> that no-clobber behavior is difficult to reasonably support.
>>
>> Second, the old np.ma.MaskedArray avoids frequent clobbering by making
>> lots of copies. Therefore, in most cases you will not lose any
>> performance in my new MaskedArray relative to the old one by making an
>> explicit copy yourself. I.e, is it problematic to have to do
>>
>>  >>> result = MaskedArray(data.copy(), trial_mask).sum()
>>
>> instead of
>>
>>  >>> marr.mask = trial_mask
>>  >>> result = marr.sum()
>>
>> since they have similar performance?
>>
>> Third, in the old np.ma.MaskedArray masked positions are very often
>> "effectively" clobbered, in the sense that they are not computed. For
>> example, 

Re: [Numpy-discussion] new MaskedArray class

2019-06-23 Thread Marten van Kerkwijk
Hi Tom,


I think a sensible alternative mental model for the MaskedArray class is
>> that all it does is forward any operations to the data it holds and
>> separately propagate a mask,
>>
>
> I'm generally on-board with that mental picture, and agree that the
> use-case described by Ben (different layers of satellite imagery) is
> important.  Same thing happens in astronomy data, e.g. you have a CCD image
> of the sky and there are cosmic rays that contaminate the image.  Those are
> not garbage data, just pixels that one wants to ignore in some, but not
> all, contexts.
>
> However, it's worth noting that one cannot blindly forward any operations
> to the data it holds since the operation may be illegal on that data.  The
> simplest example is dividing `a / b` where  `b` has data values of 0 but
> they are masked.  That operation should succeed with no exception, and here
> the resultant value under the mask is genuinely garbage.
>

Even in the present implementation, the operation is just forwarded, with
numpy errstate set to ignore all errors. And then after the fact some
half-hearted remediation is done.


> The current MaskedArray seems a bit inconsistent in dealing with invalid
> calcuations.  Dividing by 0 (if masked) is no problem and returns the
> numerator.  Taking the log of a masked 0 gives the usual divide by zero
> RuntimeWarning and puts a 1.0 under the mask of the output.
>
> Perhaps the expression should not even be evaluated on elements where the
> output mask is True, and all the masked output data values should be set to
> a predictable value (e.g. zero for numerical, zero-length string for
> string, or maybe a default fill value).  That at least provides consistent
> and predictable behavior that is simple to explain.  Otherwise the story is
> that the data under the mask *might* be OK, unless for a particular element
> the computation was invalid in which case it is filled with some arbitrary
> value.  I think that is actually an error-prone behavior that should be
> avoided.
>

I think I agree with Allan here, that after a computation, one generally
simply cannot safely assume anything for masked elements.

But it is reasonable for subclasses to define what they want to do
"post-operation"; e.g., for numerical arrays, it might make generally make
sense to do
```
notok = ~np.isfinite(result)
mask |= notok
```
and one could then also do
```
result[notok] = fill_value
```

But I think one might want to leave that to the user.

All the best,

Marten
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] new MaskedArray class

2019-06-23 Thread Stephan Hoyer
On Sun, Jun 23, 2019 at 4:07 PM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> - If reductions/aggregations default to skipping missing elements, how is
>> it be possible to express "NA propagating" versions, which are also useful,
>> if slightly less common?
>>
>
> I have been playing with using a new `Mask(np.ndarray)` class for the
> mask, which does the actual mask propagation (i.e., all single-operand
> ufuncs just copy the mask, binary operations do `logical_or` and reductions
> do `logical.and.reduce`). This way the `Masked` class itself can generally
> apply a given operation on the data and the masks separately and then
> combine the two results (reductions are the exception in that `where` has
> to be set). Your particular example here could be solved with a different
> `Mask` class, for which reductions do `logical.or.reduce`.
>

I think it would be much better to use duck-typing for the mask as well, if
possible, rather than a NumPy array subclass. This would facilitate using
alternative mask implementations, e.g., distributed masks, sparse masks,
bit-array masks, etc.

Are there use-cases for propagating masks separately from data? If not, it
might make sense to only define mask operations along with data, which
could be much simpler.


> We may want to add a standard "skipna" argument on NumPy aggregations,
>> solely for the benefit of duck arrays (and dtypes with missing values). But
>> that could also be a source of confusion, especially if skipna=True refers
>> only "true NA" values, not including NaN, which is used as an alias for NA
>> in pandas and elsewhere.
>>
>
> It does seem `where` should suffice, no? If one wants to be super-fancy,
> we could allow it to be a callable, which, if a ufunc, gets used inside the
> loop (`where=np.isfinite` would be particularly useful).
>

Let me try to make the API issue more concrete. Suppose we have a
MaskedArray with values [1, 2, NA]. How do I get:
1. The sum ignoring masked values, i.e., 3.
2. The sum that is tainted by masked values, i.e., NA.

Here's how this works with existing array libraries:
- With base NumPy using NaN as a sentinel value for NA, you can get (1)
with np.sum and (2) with np.nansum.
- With pandas and xarray, the default behavior is (1) and to get (2) you
need to write array.sum(skipna=False).
- With NumPy's current MaskedArray, it appears that you can only get (1).
Maybe there isn't as strong a need for (2) as I thought?

Your proposal would be something like np.sum(array,
where=np.ones_like(array))? This seems rather verbose for a common
operation. Perhaps np.sum(array, where=True) would work, making use of
broadcasting? (I haven't actually checked whether this is well-defined yet.)
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Syntax Improvement for Array Transpose

2019-06-23 Thread Stewart Clelland
Hi All,

Based on discussion with Marten on github
, I have a couple of
suggestions on syntax improvements on array transpose operations.

First, introducing a shorthand for the Hermitian Transpose operator. I
thought "A.HT" might be a viable candidate.

Second, the adding an array method that operates like a normal transpose.
To my understanding,
"A.tranpose()" currently inverts the usual order of all dimensions. This
may be useful in some applications involving tensors, but is not what I
would usually assume a transpose on a multi-dimensional array would entail.
I suggest a syntax of "A.MT" to indicate a transpose of the last two
dimensions by default, maybe with optional arguments (i,j) to indicate
which two dimensions to transpose.

I'm new to this mailing list format, hopefully I'm doing this right :)

Thanks,
Stew
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Syntax Improvement for Array Transpose

2019-06-23 Thread Eric Wieser
This might be contentious, but I wonder if, with a long enough deprecation
cycle, we can change the meaning of .T. That would look like:

* Emit a future warning on `more_than_2d.T` with a message like "in future
.T will transpose just the last two dimensions, not all dimensions. Use
are.transpose() if transposing all {n} dimensions is deliberate"
* Wait 5 releases or so, see how many matches Google / GitHub has for this
warning.
* If the impact is minimal, change .T
* If the impact is large, change to a deprecation warning

An argument for this approach: a good amount of code I've seen in the wild
already assumes T is a 2d transpose, and as a result does not work
correctly when called with stacks of arrays. Changing T might fix this
broken code automatically.

If the change would be too intrusive, then keeping the deprecation warning
at least prevents new users deliberately using .T for >2d transposes, which
is possibly valuable for readers.

Eric


On Sun, Jun 23, 2019, 12:05 Stewart Clelland 
wrote:

> Hi All,
>
> Based on discussion with Marten on github
> , I have a couple of
> suggestions on syntax improvements on array transpose operations.
>
> First, introducing a shorthand for the Hermitian Transpose operator. I
> thought "A.HT" might be a viable candidate.
>
> Second, the adding an array method that operates like a normal transpose.
> To my understanding,
> "A.tranpose()" currently inverts the usual order of all dimensions. This
> may be useful in some applications involving tensors, but is not what I
> would usually assume a transpose on a multi-dimensional array would entail.
> I suggest a syntax of "A.MT" to indicate a transpose of the last two
> dimensions by default, maybe with optional arguments (i,j) to indicate
> which two dimensions to transpose.
>
> I'm new to this mailing list format, hopefully I'm doing this right :)
>
> Thanks,
> Stew
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Syntax Improvement for Array Transpose

2019-06-23 Thread Hameer Abbasi
+1 for this. I have often seen (and sometimes written) code that does this 
automatically, and it is a common mistake.

However, we will need some way to filter for intent, as the people who write 
this code are the ones who didn’t read docs on it at the time, and so there 
might be a fair amount of noise even if it fixes their code.

I also agree that a transpose of an array with ndim > 2 doesn’t make sense 
without specifying the order, at least for the applications I have seen so far.

Get Outlook for iOS


From: NumPy-Discussion 
 on behalf of 
Eric Wieser 
Sent: Sunday, June 23, 2019 9:24 PM
To: Discussion of Numerical Python
Subject: Re: [Numpy-discussion] Syntax Improvement for Array Transpose

This might be contentious, but I wonder if, with a long enough deprecation 
cycle, we can change the meaning of .T. That would look like:

* Emit a future warning on `more_than_2d.T` with a message like "in future .T 
will transpose just the last two dimensions, not all dimensions. Use 
are.transpose() if transposing all {n} dimensions is deliberate"
* Wait 5 releases or so, see how many matches Google / GitHub has for this 
warning.
* If the impact is minimal, change .T
* If the impact is large, change to a deprecation warning

An argument for this approach: a good amount of code I've seen in the wild 
already assumes T is a 2d transpose, and as a result does not work correctly 
when called with stacks of arrays. Changing T might fix this broken code 
automatically.

If the change would be too intrusive, then keeping the deprecation warning at 
least prevents new users deliberately using .T for >2d transposes, which is 
possibly valuable for readers.

Eric


On Sun, Jun 23, 2019, 12:05 Stewart Clelland 
mailto:stewartclell...@gmail.com>> wrote:
Hi All,

Based on discussion with Marten on 
github, I have a couple of 
suggestions on syntax improvements on array transpose operations.

First, introducing a shorthand for the Hermitian Transpose operator. I thought 
"A.HT" might be a viable candidate.

Second, the adding an array method that operates like a normal transpose. To my 
understanding,
"A.tranpose()" currently inverts the usual order of all dimensions. This may be 
useful in some applications involving tensors, but is not what I would usually 
assume a transpose on a multi-dimensional array would entail. I suggest a 
syntax of "A.MT" to indicate a transpose of the last two 
dimensions by default, maybe with optional arguments (i,j) to indicate which 
two dimensions to transpose.

I'm new to this mailing list format, hopefully I'm doing this right :)

Thanks,
Stew
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Syntax Improvement for Array Transpose

2019-06-23 Thread Sebastian Berg
On Sun, 2019-06-23 at 19:51 +, Hameer Abbasi wrote:
> +1 for this. I have often seen (and sometimes written) code that does
> this automatically, and it is a common mistake.

Yeah, likely worth a short. I doubt many uses for the n-dimensional
axis transpose, so maybe a futurewarning approach can work. If not, I
suppose the solution is the deprecation for ndim != 2.

Another point about the `.T` is the 1-dimensional case, which commonly
causes confusion. If we do something here, should think about that as
well.

- Sebastian


> 
> However, we will need some way to filter for intent, as the people
> who write this code are the ones who didn’t read docs on it at the
> time, and so there might be a fair amount of noise even if it fixes
> their code.
> 
> I also agree that a transpose of an array with ndim > 2 doesn’t make
> sense without specifying the order, at least for the applications I
> have seen so far.
> 
> Get Outlook for iOS
>  
> From: NumPy-Discussion <
> numpy-discussion-bounces+einstein.edison=gmail@python.org> on
> behalf of Eric Wieser 
> Sent: Sunday, June 23, 2019 9:24 PM
> To: Discussion of Numerical Python
> Subject: Re: [Numpy-discussion] Syntax Improvement for Array
> Transpose
>  
> This might be contentious, but I wonder if, with a long enough
> deprecation cycle, we can change the meaning of .T. That would look
> like:
> 
> * Emit a future warning on `more_than_2d.T` with a message like "in
> future .T will transpose just the last two dimensions, not all
> dimensions. Use are.transpose() if transposing all {n} dimensions is
> deliberate"
> * Wait 5 releases or so, see how many matches Google / GitHub has for
> this warning.
> * If the impact is minimal, change .T
> * If the impact is large, change to a deprecation warning
> 
> An argument for this approach: a good amount of code I've seen in the
> wild already assumes T is a 2d transpose, and as a result does not
> work correctly when called with stacks of arrays. Changing T might
> fix this broken code automatically.
> 
> If the change would be too intrusive, then keeping the deprecation
> warning at least prevents new users deliberately using .T for >2d
> transposes, which is possibly valuable for readers.
> 
> Eric
> 
> 
> On Sun, Jun 23, 2019, 12:05 Stewart Clelland <
> stewartclell...@gmail.com> wrote:
> > Hi All,
> > 
> > Based on discussion with Marten on github, I have a couple of
> > suggestions on syntax improvements on array transpose operations.
> > 
> > First, introducing a shorthand for the Hermitian Transpose
> > operator. I thought "A.HT" might be a viable candidate.
> > 
> > Second, the adding an array method that operates like a normal
> > transpose. To my understanding,
> > "A.tranpose()" currently inverts the usual order of all dimensions.
> > This may be useful in some applications involving tensors, but is
> > not what I would usually assume a transpose on a multi-dimensional
> > array would entail. I suggest a syntax of "A.MT" to indicate a
> > transpose of the last two dimensions by default, maybe with
> > optional arguments (i,j) to indicate which two dimensions to
> > transpose.
> > 
> > I'm new to this mailing list format, hopefully I'm doing this right
> > :)
> > 
> > Thanks,
> > Stew
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@python.org
> > https://mail.python.org/mailman/listinfo/numpy-discussion
> > 
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion


signature.asc
Description: This is a digitally signed message part
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] new MaskedArray class

2019-06-23 Thread Marten van Kerkwijk
Hi Stephan,

In slightly changed order:

Let me try to make the API issue more concrete. Suppose we have a
> MaskedArray with values [1, 2, NA]. How do I get:
> 1. The sum ignoring masked values, i.e., 3.
> 2. The sum that is tainted by masked values, i.e., NA.
>
> Here's how this works with existing array libraries:
> - With base NumPy using NaN as a sentinel value for NA, you can get (1)
> with np.sum and (2) with np.nansum.
> - With pandas and xarray, the default behavior is (1) and to get (2) you
> need to write array.sum(skipna=False).
> - With NumPy's current MaskedArray, it appears that you can only get (1).
> Maybe there isn't as strong a need for (2) as I thought?
>

I think this is all correct.

>
> Your proposal would be something like np.sum(array,
> where=np.ones_like(array))? This seems rather verbose for a common
> operation. Perhaps np.sum(array, where=True) would work, making use of
> broadcasting? (I haven't actually checked whether this is well-defined yet.)
>
> I think we'd need to consider separately the operation on the mask and on
the data. In my proposal, the data would always do `np.sum(array,
where=~mask)`, while how the mask would propagate might depend on the mask
itself, i.e., we'd have different mask types for `skipna=True` (default)
and `False` ("contagious") reductions, which differed in doing
`logical_and.reduce` or `logical_or.reduce` on the mask.

I have been playing with using a new `Mask(np.ndarray)` class for the mask,
>> which does the actual mask propagation (i.e., all single-operand ufuncs
>> just copy the mask, binary operations do `logical_or` and reductions do
>> `logical.and.reduce`). This way the `Masked` class itself can generally
>> apply a given operation on the data and the masks separately and then
>> combine the two results (reductions are the exception in that `where` has
>> to be set). Your particular example here could be solved with a different
>> `Mask` class, for which reductions do `logical.or.reduce`.
>>
>
> I think it would be much better to use duck-typing for the mask as well,
> if possible, rather than a NumPy array subclass. This would facilitate
> using alternative mask implementations, e.g., distributed masks, sparse
> masks, bit-array masks, etc.
>

Implicitly in the above, I agree with having the mask not necessarily be a
plain ndarray, but something that can determine part of the action. Makes
sense to generalize that to duck arrays for the reasons you give. Indeed,
if we let the mask do the mask propagation as well, it might help make the
implementation substantially easier (e.g., `logical_and.reduce` and
`logical_or.reduce` can be super-fast on a bitmask!).


> Are there use-cases for propagating masks separately from data? If not, it
> might make sense to only define mask operations along with data, which
> could be much simpler.
>

I had only thought about separating out the concern of mask propagation
from the "MaskedArray" class to the mask proper, but it might indeed make
things easier if the mask also did any required preparation for passing
things on to the data (such as adjusting the "where" argument in a
reduction). I also like that this way the mask can determine even before
the data what functionality is available (i.e., it could be the place from
which to return `NotImplemented` for a ufunc.at call with a masked index
argument).

It may be good to collect a few more test cases... E.g., I'd like to mask
some of the astropy classes that are only very partial duck arrays, in that
they cover only the shape aspect, and which do have some operators and for
which it would be nice not to feel forced to use __array_ufunc__.

All the best,

Marten
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Syntax Improvement for Array Transpose

2019-06-23 Thread Andras Deak
On Sun, Jun 23, 2019 at 10:37 PM Sebastian Berg
 wrote:
> Yeah, likely worth a short. I doubt many uses for the n-dimensional
> axis transpose, so maybe a futurewarning approach can work. If not, I
> suppose the solution is the deprecation for ndim != 2.

Any chance that the n-dimensional transpose is being used in code
interfacing fortran/matlab and python? One thing the current
multidimensional transpose is good for is to switch between row-major
and column-major order. I don't know, however, whether this switch
actually has to be done often in code, in practice.

András
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Syntax Improvement for Array Transpose

2019-06-23 Thread Marten van Kerkwijk
Hi All,

I'd love to have `.T` mean the right thing, and am happy that people are
suggesting it after I told Steward this was likely off-limits (which, in
fairness, did seem to be the conclusion when we visited this before...).
But is there something we can do to make it possible to use it already but
ensure that code on previous numpy versions breaks? (Or works, but that
seems impossible...)

For instance, in python2, one had `from __future__ import division (etc.);
could we have, e.g., a `from numpy.__future__ import matrix_transpose`,
which, when imported, implied that `.T` just did the right thing without
any warning? (Obviously, since that __future__.matrix_transpose wouldn't
exist on older versions of numpy, it would correctly break the code when
used with those.)

Also, a bit more towards the original request in the PR of a hermitian
transpose, if we're trying to go for `.T` eventually having the obvious
meaning, should we directly move towards also having `.H` as a short-cut
for `.T.conj()`? We could even expose that only with the above future
import - otherwise, the risk of abuse of `.T` would only grow...

Finally, on the meaning of `.T` for 1-D arrays, the sensible choices would
seem to (1) error; or (2) change shape to `(n, 1)`. Since while writing
this sentence I changed my preference twice, I guess I should go for
erroring (I think we need a separate solution for easily making stacks of
row/column vectors).

All the best,

Marten

On Sun, Jun 23, 2019 at 4:37 PM Sebastian Berg 
wrote:

> On Sun, 2019-06-23 at 19:51 +, Hameer Abbasi wrote:
> > +1 for this. I have often seen (and sometimes written) code that does
> > this automatically, and it is a common mistake.
>
> Yeah, likely worth a short. I doubt many uses for the n-dimensional
> axis transpose, so maybe a futurewarning approach can work. If not, I
> suppose the solution is the deprecation for ndim != 2.
>
> Another point about the `.T` is the 1-dimensional case, which commonly
> causes confusion. If we do something here, should think about that as
> well.
>
> - Sebastian
>
>
> >
> > However, we will need some way to filter for intent, as the people
> > who write this code are the ones who didn’t read docs on it at the
> > time, and so there might be a fair amount of noise even if it fixes
> > their code.
> >
> > I also agree that a transpose of an array with ndim > 2 doesn’t make
> > sense without specifying the order, at least for the applications I
> > have seen so far.
> >
> > Get Outlook for iOS
> >
> > From: NumPy-Discussion <
> > numpy-discussion-bounces+einstein.edison=gmail@python.org> on
> > behalf of Eric Wieser 
> > Sent: Sunday, June 23, 2019 9:24 PM
> > To: Discussion of Numerical Python
> > Subject: Re: [Numpy-discussion] Syntax Improvement for Array
> > Transpose
> >
> > This might be contentious, but I wonder if, with a long enough
> > deprecation cycle, we can change the meaning of .T. That would look
> > like:
> >
> > * Emit a future warning on `more_than_2d.T` with a message like "in
> > future .T will transpose just the last two dimensions, not all
> > dimensions. Use are.transpose() if transposing all {n} dimensions is
> > deliberate"
> > * Wait 5 releases or so, see how many matches Google / GitHub has for
> > this warning.
> > * If the impact is minimal, change .T
> > * If the impact is large, change to a deprecation warning
> >
> > An argument for this approach: a good amount of code I've seen in the
> > wild already assumes T is a 2d transpose, and as a result does not
> > work correctly when called with stacks of arrays. Changing T might
> > fix this broken code automatically.
> >
> > If the change would be too intrusive, then keeping the deprecation
> > warning at least prevents new users deliberately using .T for >2d
> > transposes, which is possibly valuable for readers.
> >
> > Eric
> >
> >
> > On Sun, Jun 23, 2019, 12:05 Stewart Clelland <
> > stewartclell...@gmail.com> wrote:
> > > Hi All,
> > >
> > > Based on discussion with Marten on github, I have a couple of
> > > suggestions on syntax improvements on array transpose operations.
> > >
> > > First, introducing a shorthand for the Hermitian Transpose
> > > operator. I thought "A.HT" might be a viable candidate.
> > >
> > > Second, the adding an array method that operates like a normal
> > > transpose. To my understanding,
> > > "A.tranpose()" currently inverts the usual order of all dimensions.
> > > This may be useful in some applications involving tensors, but is
> > > not what I would usually assume a transpose on a multi-dimensional
> > > array would entail. I suggest a syntax of "A.MT" to indicate a
> > > transpose of the last two dimensions by default, maybe with
> > > optional arguments (i,j) to indicate which two dimensions to
> > > transpose.
> > >
> > > I'm new to this mailing list format, hopefully I'm doing this right
> > > :)
> > >
> > > Thanks,
> > > Stew
> > > ___
> > > NumPy-Discussion 

Re: [Numpy-discussion] Syntax Improvement for Array Transpose

2019-06-23 Thread Sebastian Berg
On Sun, 2019-06-23 at 23:03 +0200, Andras Deak wrote:
> On Sun, Jun 23, 2019 at 10:37 PM Sebastian Berg
>  wrote:
> > Yeah, likely worth a short. I doubt many uses for the n-dimensional
> > axis transpose, so maybe a futurewarning approach can work. If not,
> > I
> > suppose the solution is the deprecation for ndim != 2.
> 
> Any chance that the n-dimensional transpose is being used in code
> interfacing fortran/matlab and python? One thing the current
> multidimensional transpose is good for is to switch between row-major
> and column-major order. I don't know, however, whether this switch
> actually has to be done often in code, in practice.
> 

I suppose there is a chance for that, to fix the order for returned
arrays (for input arrays you probably need to fix the memory order, so
that `copy(..., order="F")` or `np.ensure` is more likely what you
want.

Those users should be fine to switch over to `arr.transpose()`. The
question is mostly if it hits so much code that it is painful.

- Sebastian



> András
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion


signature.asc
Description: This is a digitally signed message part
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Syntax Improvement for Array Transpose

2019-06-23 Thread Sebastian Berg
On Sun, 2019-06-23 at 17:12 -0400, Marten van Kerkwijk wrote:
> Hi All,
> 
> I'd love to have `.T` mean the right thing, and am happy that people
> are suggesting it after I told Steward this was likely off-limits
> (which, in fairness, did seem to be the conclusion when we visited
> this before...). But is there something we can do to make it possible
> to use it already but ensure that code on previous numpy versions
> breaks? (Or works, but that seems impossible...)
> 
> For instance, in python2, one had `from __future__ import division
> (etc.); could we have, e.g., a `from numpy.__future__ import
> matrix_transpose`, which, when imported, implied that `.T` just did
> the right thing without any warning? (Obviously, since that
> __future__.matrix_transpose wouldn't exist on older versions of
> numpy, it would correctly break the code when used with those.)
> 

If I remember correctly, this is actually possible but hacky. So it
would probably be nicer to not go there. But yes, you are right, that
would mean that we practically limit `.T` to 2-D arrays for at least 2
years.

> Also, a bit more towards the original request in the PR of a
> hermitian transpose, if we're trying to go for `.T` eventually having
> the obvious meaning, should we directly move towards also having `.H`
> as a short-cut for `.T.conj()`? We could even expose that only with
> the above future import - otherwise, the risk of abuse of `.T` would
> only grow...

This opens the general question of how many and which attributes we
actually want on ndarray. My first gut reaction is that I am -0 on it,
but OTOH, for some math it is very nice and not a huge amount of
clutter...


> 
> Finally, on the meaning of `.T` for 1-D arrays, the sensible choices
> would seem to (1) error; or (2) change shape to `(n, 1)`. Since while
> writing this sentence I changed my preference twice, I guess I should
> go for erroring (I think we need a separate solution for easily
> making stacks of row/column vectors).

Probably an error is good, which is nice, because we can just tag on a
warning and not worry about it for a while ;).

> 
> All the best,
> 
> Marten
> 
> On Sun, Jun 23, 2019 at 4:37 PM Sebastian Berg <
> sebast...@sipsolutions.net> wrote:
> > On Sun, 2019-06-23 at 19:51 +, Hameer Abbasi wrote:
> > > +1 for this. I have often seen (and sometimes written) code that
> > does
> > > this automatically, and it is a common mistake.
> > 
> > Yeah, likely worth a short. I doubt many uses for the n-dimensional
> > axis transpose, so maybe a futurewarning approach can work. If not,
> > I
> > suppose the solution is the deprecation for ndim != 2.
> > 
> > Another point about the `.T` is the 1-dimensional case, which
> > commonly
> > causes confusion. If we do something here, should think about that
> > as
> > well.
> > 
> > - Sebastian
> > 
> > 
> > > 
> > > However, we will need some way to filter for intent, as the
> > people
> > > who write this code are the ones who didn’t read docs on it at
> > the
> > > time, and so there might be a fair amount of noise even if it
> > fixes
> > > their code.
> > > 
> > > I also agree that a transpose of an array with ndim > 2 doesn’t
> > make
> > > sense without specifying the order, at least for the applications
> > I
> > > have seen so far.
> > > 
> > > Get Outlook for iOS
> > >  
> > > From: NumPy-Discussion <
> > > numpy-discussion-bounces+einstein.edison=gmail@python.org> on
> > > behalf of Eric Wieser 
> > > Sent: Sunday, June 23, 2019 9:24 PM
> > > To: Discussion of Numerical Python
> > > Subject: Re: [Numpy-discussion] Syntax Improvement for Array
> > > Transpose
> > >  
> > > This might be contentious, but I wonder if, with a long enough
> > > deprecation cycle, we can change the meaning of .T. That would
> > look
> > > like:
> > > 
> > > * Emit a future warning on `more_than_2d.T` with a message like
> > "in
> > > future .T will transpose just the last two dimensions, not all
> > > dimensions. Use are.transpose() if transposing all {n} dimensions
> > is
> > > deliberate"
> > > * Wait 5 releases or so, see how many matches Google / GitHub has
> > for
> > > this warning.
> > > * If the impact is minimal, change .T
> > > * If the impact is large, change to a deprecation warning
> > > 
> > > An argument for this approach: a good amount of code I've seen in
> > the
> > > wild already assumes T is a 2d transpose, and as a result does
> > not
> > > work correctly when called with stacks of arrays. Changing T
> > might
> > > fix this broken code automatically.
> > > 
> > > If the change would be too intrusive, then keeping the
> > deprecation
> > > warning at least prevents new users deliberately using .T for >2d
> > > transposes, which is possibly valuable for readers.
> > > 
> > > Eric
> > > 
> > > 
> > > On Sun, Jun 23, 2019, 12:05 Stewart Clelland <
> > > stewartclell...@gmail.com> wrote:
> > > > Hi All,
> > > > 
> > > > Based on discussion with Marten on github, I have a couple of
> > > > suggestions 

Re: [Numpy-discussion] Syntax Improvement for Array Transpose

2019-06-23 Thread Eric Wieser
If I remember correctly, [numpy.future imports are] actually possible
but hacky. So it would probably be nicer to not go there.

There was some discussion of this at
https://stackoverflow.com/q/29905278/102441.
I agree with the conclusion we should not go there - in particular,
note that every builtin __future__ feature has been an
interpreter-level change, not an object-level change.
from __future__ import division changes the meaning of / not of int.__div__.
Framing the numpy change this way would mean rewriting Attribute(obj,
attr, Load) ast nodes to Call(np._attr_override, obj, attr), which is
obvious not interoperable with any other module wanting to do the same
thing.

This opens other unpleasant cans of worms about “builtin” modules that
perform attribute access:

Should getattr(arr, 'T') change behavior based on the module that calls it?
Should operator.itemgetter('T') change behavior ?

So I do not think we want to go down that road.

On Sun, 23 Jun 2019 at 14:28, Sebastian Berg  wrote:
>
> On Sun, 2019-06-23 at 17:12 -0400, Marten van Kerkwijk wrote:
> > Hi All,
> >
> > I'd love to have `.T` mean the right thing, and am happy that people
> > are suggesting it after I told Steward this was likely off-limits
> > (which, in fairness, did seem to be the conclusion when we visited
> > this before...). But is there something we can do to make it possible
> > to use it already but ensure that code on previous numpy versions
> > breaks? (Or works, but that seems impossible...)
> >
> > For instance, in python2, one had `from __future__ import division
> > (etc.); could we have, e.g., a `from numpy.__future__ import
> > matrix_transpose`, which, when imported, implied that `.T` just did
> > the right thing without any warning? (Obviously, since that
> > __future__.matrix_transpose wouldn't exist on older versions of
> > numpy, it would correctly break the code when used with those.)
> >
>
> If I remember correctly, this is actually possible but hacky. So it
> would probably be nicer to not go there. But yes, you are right, that
> would mean that we practically limit `.T` to 2-D arrays for at least 2
> years.
>
> > Also, a bit more towards the original request in the PR of a
> > hermitian transpose, if we're trying to go for `.T` eventually having
> > the obvious meaning, should we directly move towards also having `.H`
> > as a short-cut for `.T.conj()`? We could even expose that only with
> > the above future import - otherwise, the risk of abuse of `.T` would
> > only grow...
>
> This opens the general question of how many and which attributes we
> actually want on ndarray. My first gut reaction is that I am -0 on it,
> but OTOH, for some math it is very nice and not a huge amount of
> clutter...
>
>
> >
> > Finally, on the meaning of `.T` for 1-D arrays, the sensible choices
> > would seem to (1) error; or (2) change shape to `(n, 1)`. Since while
> > writing this sentence I changed my preference twice, I guess I should
> > go for erroring (I think we need a separate solution for easily
> > making stacks of row/column vectors).
>
> Probably an error is good, which is nice, because we can just tag on a
> warning and not worry about it for a while ;).
>
> >
> > All the best,
> >
> > Marten
> >
> > On Sun, Jun 23, 2019 at 4:37 PM Sebastian Berg <
> > sebast...@sipsolutions.net> wrote:
> > > On Sun, 2019-06-23 at 19:51 +, Hameer Abbasi wrote:
> > > > +1 for this. I have often seen (and sometimes written) code that
> > > does
> > > > this automatically, and it is a common mistake.
> > >
> > > Yeah, likely worth a short. I doubt many uses for the n-dimensional
> > > axis transpose, so maybe a futurewarning approach can work. If not,
> > > I
> > > suppose the solution is the deprecation for ndim != 2.
> > >
> > > Another point about the `.T` is the 1-dimensional case, which
> > > commonly
> > > causes confusion. If we do something here, should think about that
> > > as
> > > well.
> > >
> > > - Sebastian
> > >
> > >
> > > >
> > > > However, we will need some way to filter for intent, as the
> > > people
> > > > who write this code are the ones who didn’t read docs on it at
> > > the
> > > > time, and so there might be a fair amount of noise even if it
> > > fixes
> > > > their code.
> > > >
> > > > I also agree that a transpose of an array with ndim > 2 doesn’t
> > > make
> > > > sense without specifying the order, at least for the applications
> > > I
> > > > have seen so far.
> > > >
> > > > Get Outlook for iOS
> > > >
> > > > From: NumPy-Discussion <
> > > > numpy-discussion-bounces+einstein.edison=gmail@python.org> on
> > > > behalf of Eric Wieser 
> > > > Sent: Sunday, June 23, 2019 9:24 PM
> > > > To: Discussion of Numerical Python
> > > > Subject: Re: [Numpy-discussion] Syntax Improvement for Array
> > > > Transpose
> > > >
> > > > This might be contentious, but I wonder if, with a long enough
> > > > deprecation cycle, we can change the meaning of .T. That would
> > > look
> > > > like:
> > > >

Re: [Numpy-discussion] new MaskedArray class

2019-06-23 Thread Stephan Hoyer
On Sun, Jun 23, 2019 at 11:55 PM Marten van Kerkwijk <
m.h.vankerkw...@gmail.com> wrote:

> Your proposal would be something like np.sum(array,
>> where=np.ones_like(array))? This seems rather verbose for a common
>> operation. Perhaps np.sum(array, where=True) would work, making use of
>> broadcasting? (I haven't actually checked whether this is well-defined yet.)
>>
>> I think we'd need to consider separately the operation on the mask and on
> the data. In my proposal, the data would always do `np.sum(array,
> where=~mask)`, while how the mask would propagate might depend on the mask
> itself, i.e., we'd have different mask types for `skipna=True` (default)
> and `False` ("contagious") reductions, which differed in doing
> `logical_and.reduce` or `logical_or.reduce` on the mask.
>

OK, I think I finally understand what you're getting at. So suppose this
this how we implement it internally. Would we really insist on a user
creating a new MaskedArray with a new mask object, e.g., with a GreedyMask?
We could add sugar for this, but certainly array.greedy_masked().sum() is
significantly less clear than array.sum(skipna=False).

I'm also a little concerned about a proliferation of MaskedArray/Mask
types. New types are significantly harder to understand than new functions
(or new arguments on existing functions). I don't know if we have enough
distinct use cases for this many types.

Are there use-cases for propagating masks separately from data? If not, it
>> might make sense to only define mask operations along with data, which
>> could be much simpler.
>>
>
> I had only thought about separating out the concern of mask propagation
> from the "MaskedArray" class to the mask proper, but it might indeed make
> things easier if the mask also did any required preparation for passing
> things on to the data (such as adjusting the "where" argument in a
> reduction). I also like that this way the mask can determine even before
> the data what functionality is available (i.e., it could be the place from
> which to return `NotImplemented` for a ufunc.at call with a masked index
> argument).
>

You're going to have to come up with something more compelling than
"separation of concerns" to convince me that this extra Mask abstraction is
worthwhile. On its own, I think a separate Mask class would only obfuscate
MaskedArray functions.

For example, compare these two implementations of add:

def  add1(x, y):
return MaskedArray(x.data + y.data,  x.mask | y.mask)

def  add2(x, y):
return MaskedArray(x.data + y.data,  x.mask + y.mask)

The second version requires that you *also* know how Mask classes work, and
how they implement +. So now you need to look in at least twice as many
places to understand add() for MaskedArray objects.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] new MaskedArray class

2019-06-23 Thread Eric Wieser
I think we’d need to consider separately the operation on the mask and on
the data. In my proposal, the data would always do np.sum(array,
where=~mask), while how the mask would propagate might depend on the mask
itself,

I quite like this idea, and I think Stephan’s strawman design is actually
plausible, where MaskedArray.mask is either an InvalidMask or a IgnoreMask
instance to pick between the different propagation types. Both classes
could simply have an underlying ._array attribute pointing to a duck-array
of some kind that backs their boolean data.

The second version requires that you *also* know how Mask classes work, and
how they implement +

I remain unconvinced that Mask classes should behave differently on
different ufuncs. I don’t think np.minimum(ignore_na, b) is any different
to np.add(ignore_na, b) - either both should produce b, or both should
produce ignore_na. I would lean towards produxing ignore_na, and
propagation behavior differing between “ignore” and “invalid” only for
reduce / accumulate operations, where the concept of skipping an
application is well-defined.

Some possible follow-up questions that having two distinct masked types
raise:

   - what if I want my data to support both invalid and skip fields at the
   same time? sum([invalid, skip, 1]) == invalid
   - is there a use case for more that these two types of mask?
   invalid_due_to_reason_A, invalid_due_to_reason_B would be interesting
   things to track through a calculation, possibly a dictionary of named masks.

Eric

On Sun, 23 Jun 2019 at 15:28, Stephan Hoyer  wrote:

> On Sun, Jun 23, 2019 at 11:55 PM Marten van Kerkwijk <
> m.h.vankerkw...@gmail.com> wrote:
>
>> Your proposal would be something like np.sum(array,
>>> where=np.ones_like(array))? This seems rather verbose for a common
>>> operation. Perhaps np.sum(array, where=True) would work, making use of
>>> broadcasting? (I haven't actually checked whether this is well-defined yet.)
>>>
>>> I think we'd need to consider separately the operation on the mask and
>> on the data. In my proposal, the data would always do `np.sum(array,
>> where=~mask)`, while how the mask would propagate might depend on the mask
>> itself, i.e., we'd have different mask types for `skipna=True` (default)
>> and `False` ("contagious") reductions, which differed in doing
>> `logical_and.reduce` or `logical_or.reduce` on the mask.
>>
>
> OK, I think I finally understand what you're getting at. So suppose this
> this how we implement it internally. Would we really insist on a user
> creating a new MaskedArray with a new mask object, e.g., with a GreedyMask?
> We could add sugar for this, but certainly array.greedy_masked().sum() is
> significantly less clear than array.sum(skipna=False).
>
> I'm also a little concerned about a proliferation of MaskedArray/Mask
> types. New types are significantly harder to understand than new functions
> (or new arguments on existing functions). I don't know if we have enough
> distinct use cases for this many types.
>
> Are there use-cases for propagating masks separately from data? If not, it
>>> might make sense to only define mask operations along with data, which
>>> could be much simpler.
>>>
>>
>> I had only thought about separating out the concern of mask propagation
>> from the "MaskedArray" class to the mask proper, but it might indeed make
>> things easier if the mask also did any required preparation for passing
>> things on to the data (such as adjusting the "where" argument in a
>> reduction). I also like that this way the mask can determine even before
>> the data what functionality is available (i.e., it could be the place from
>> which to return `NotImplemented` for a ufunc.at call with a masked index
>> argument).
>>
>
> You're going to have to come up with something more compelling than
> "separation of concerns" to convince me that this extra Mask abstraction is
> worthwhile. On its own, I think a separate Mask class would only obfuscate
> MaskedArray functions.
>
> For example, compare these two implementations of add:
>
> def  add1(x, y):
> return MaskedArray(x.data + y.data,  x.mask | y.mask)
>
> def  add2(x, y):
> return MaskedArray(x.data + y.data,  x.mask + y.mask)
>
> The second version requires that you *also* know how Mask classes work,
> and how they implement +. So now you need to look in at least twice as many
> places to understand add() for MaskedArray objects.
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Syntax Improvement for Array Transpose

2019-06-23 Thread Marten van Kerkwijk
I had not looked at any implementation (only remembered the nice idea of
"importing from the future"), and looking at the links Eric shared, it
seems that the only way this would work is, effectively, pre-compilation
doing a `.replace('.T', '._T_from_the_future')`, where you'd be
hoping that there never is any other meaning for a `.T` attribute, for any
class, since it is impossible to be sure a given variable is an ndarray.
(Actually, a lot less implausible than for the case of numpy indexing
discussed in the link...)

Anyway, what I had in mind was something along the lines of inside the `.T`
code there being be a check on whether a particular future item was present
in the environment. But thinking more, I can see that it is not trivial to
get to know something about the environment in which the code that called
you was written

So, it seems there is no (simple) way to tell numpy that inside a given
module you want `.T` to have the new behaviour, but still to warn if
outside the module it is used in the old way (when risky)?

-- Marten

p.s. I'm somewhat loath to add new properties to ndarray, but `.T` and `.H`
have such obvious and clear meaning to anyone dealing with (complex)
matrices that I think it is worth it. See
https://mail.python.org/pipermail/numpy-discussion/2019-June/079584.html
for a list of options of attributes that we might deprecate "in exchange"...

All the best,

Marten
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] new MaskedArray class

2019-06-23 Thread Marten van Kerkwijk
Hi Stephan,

Eric perhaps explained my concept better than I could!

I do agree that, as written, your example would be clearer, but Allan's
code and the current MaskedArray code do have not that much semblance to
it, and mine even less, as they deal with operators as whole groups.

For mine, it may be useful to tell that this quite possibly crazy idea came
from considering how one would quite logically implement all shape
operations, which effect the data and mask the same way, so one soon writes
down something where (as in my ShapedLikeNDArray mixin;
https://github.com/astropy/astropy/blob/master/astropy/utils/misc.py#L858)
all methods pass on to `return self._apply(method, ), with
`_apply` looking like:
```
def _apply(self, method, *args, **kwargs):
# For use with NDArrayShapeMethods.
data = getattr(self._data, method)(*args, **kwargs)
mask = getattr(self._mask, method)(*args, **kwargs)

return self.__class__(data, mask=mask)
```
(Or the same is auto-generated for all those methods.)

Now considering this, for `__array_ufunc__`, one might similarly do
(__call__ only, ignoring outputs)
```
def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
data = []
masks = []
for input_ in inputs:
d, m = self._data_mask(input_)
data.append(d)
masks.append(m)
result_mask = getattr(ufunc, method)(*masks, **mask_kwargs)
if result_mask is NotImplemented:
return NotImplemented

result_data = getattr(ufunc, method)(*data, **kwargs)
return self._masked_result(result_data, result_mask, out)
```
The beauty here is that the Masked class itself needs to know very little
about how to do masking, or how to implement it; it is almost like a
collection of arrays: the only thing that separates `__array_ufunc__` from
`_apply` above is that, since other objects are involved, the data and mask
have to be separated out for those. (And perhaps for people trying to
change it, it actually helps that the special-casing of single-, double-,
and triple-input ufuncs can all be done inside the mask class, and
adjustments can be made there without having to understand the machinery
for getting masks out of data, etc.?)

But what brought me to this was not __array_ufunc__ itself, but rather the
next step, that I really would like to have masked version of classes that
are array-like in shape but do not generally work with numpy ufuncs or
functions, and where I prefer not to force the use of numpy ufuncs. So,
inside the masked class I have to override the operators. Obviously, there
again I could do it with functions like you describe, but I can also just
split off data and masks as above, and just call the same operator on both.
Then, the code could basically use that of `__array_ufunc__` above when
called as `self.__array_ufunc__(operator, '__add__', self, other)`.

I think (but am not sure, not actually having tried it yet), that this
would also make it easier to mostly auto-generated masked classes: those
supported by both the mask and the data are relatively easy; those separate
for the data need some hints from the data class (e.g., `.unit` on a
`Quantity` is independent of the mask).

Anyway, just to be clear, I am *not* sure that this is the right way
forward. I think the more important really was your suggestion to take mask
duck types into account a priori (for bit-packed ones, etc.). But it seems
worth thinking it through now, so we don't repeat the things that ended up
making `MaskedArray` quite annoying.

All the best,

Marten
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] new MaskedArray class

2019-06-23 Thread Marten van Kerkwijk
Hi Eric,

On your other points:

I remain unconvinced that Mask classes should behave differently on
> different ufuncs. I don’t think np.minimum(ignore_na, b) is any different
> to np.add(ignore_na, b) - either both should produce b, or both should
> produce ignore_na. I would lean towards produxing ignore_na, and
> propagation behavior differing between “ignore” and “invalid” only for
> reduce / accumulate operations, where the concept of skipping an
> application is well-defined.
>
I think I mostly agree - this is really about reductions. And this fact
that there are apparently only two choices weakens the case for pushing the
logic into the mask class itself.

But the one case that still tempts me to break with the strict rule for
ufunc.__call__ is `fmin, fmax` vs `minimum, maximum`... What do you think?


> Some possible follow-up questions that having two distinct masked types
> raise:
>
>- what if I want my data to support both invalid and skip fields at
>the same time? sum([invalid, skip, 1]) == invalid
>
> Have a triple-valued mask? Would be easy to implement if all the logic is
in the mask...

(Indeed, for all I care it could implement weighting! That would actually
care about the actual operation, so would be a real example. Though of
course it also does need access to the actual data, so perhaps best not to
go there...)

>
>- is there a use case for more that these two types of mask?
>invalid_due_to_reason_A, invalid_due_to_reason_B would be interesting
>things to track through a calculation, possibly a dictionary of named 
> masks.
>
> For astropy's NDData, there has been quite a bit of discussion of a
`Flags` object, which works exactly as you describe, an OR together of
different reasons for why data is invalid (HST uses this, though the
discussion was for the Large Synodic Survey Telescope data pipeline). Those
flags are propagated like masks.

I think in most cases, these examples would not require more than allowing
the mask to be a duck type. Though perhaps for some reductions, it might
matter what the reduction of the data is actually doing (e.g.,
`np.minimum.reduce` might need different rules than `np.add.reduce`). And,
of course, one can argue that for such a case it might be best to subclass
MaskedArray itself, and do the logic there.

All the best,

Marten

p.s. For accumulations, I'm still not sure I find them well-defined. I
could see that np.add.reduce([0, 1, 1, --, 3])` could lead to `[0, 1, 2,
5]`, i.e., a shorter sequence, but this doesn't work on arrays where
different rows can have different numbers of masked elements. It then
perhaps suggests `[0, 1, 2, --, 5]` is OK, but the annoyance I have is that
there is nothing that tells me what the underlying data should be, i.e.,
this is truly different from having a `where` keyword in `np.add.reduce`.
But perhaps it is that I do not see much use for accumulation beyond
changing a histogram into its cumulative version - for which masked
elements really make no sense - one somehow has to interpolate over, not
just discount the masked one.
___
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion