Re: [Numpy-discussion] Proposal to accept NEP-18, __array_function__ protocol

Stephan Hoyer Tue, 21 Aug 2018 09:43:05 -0700

On Tue, Aug 21, 2018 at 12:21 AM Nathaniel Smith <[email protected]> wrote:

> On Wed, Aug 15, 2018 at 9:45 AM, Stephan Hoyer <[email protected]> wrote:
> > This avoids a classic subclassing problem that has plagued NumPy for
> years,
> > where overriding the behavior of method A causes apparently unrelated
> method
> > B to break, because it relied on method A internally. In NumPy, this
> > constrained our implementation of np.median(), because it needed to call
> > np.mean() in order for subclasses implementing units to work properly.
>
> I don't think I follow... if B uses A internally, then overriding A
> shouldn't cause B to break, unless the overridden A is buggy.
>

Let me try another example with arrays with units. My understanding of the
contract provided by unit implementations is their behavior should never
deviate from NumPy unless an operation raises an error. (This is more
explicit for arrays with units because they raise errors for operations
with incompatible units, but practically speaking almost all duck arrays
will have at least some unsupported operations in NumPy's giant API.)

It is quite possible that NumPy functions could be (re)written in a way
that is incompatible with some unit implementations but is perfectly valid
for "full" duck arrays. We actually see this even within NumPy already --
for example, see this recent PR adding support for the datetime64 dtype to
percentile:
https://github.com/numpy/numpy/pull/11627

A lesser case of this are changes in NumPy causing performance issues for
users of duck arrays, which is basically inevitable if we share
implementations.

I don't think it's possible to anticipate all of these cases, and I don't
want NumPy to be unduly constrained in its internal design. I want our user
support answer to be simple: if you care about performance for a particular
array operations on your type of arrays, you should implement it yourself
(i.e., with __array_function__).

This definitely doesn't preclude the careful, systematic overriding
approach. But I think we'll almost always want NumPy's external API to be
overridable.

And when we fix a bug in row_stack, this means we also have to fix it
> in all the copy-paste versions, which won't happen, so np.row_stack
> has different semantics on different objects, even if they started out
> matching. The NDArrayOperatorsMixin reduces the number of duplicate
> copies of the same code that need to be updated, but 2 copies is still
> a lot worse than 1 copy :-).
>

I see your point, but in all seriousness if encounter a bug in np.row_stack
at this point we might just call it a feature instead.

> > 1. The details of how NumPy implements a high-level function in terms of
> overloaded functions now becomes an implicit part of NumPy’s public API.
> For example, refactoring stack to use np.block() instead of
> np.concatenate() internally would now become a breaking change.
>
> The way I'm imagining this would work is, we guarantee not to take a
> function that used to be implemented in terms of overridable
> operations, and refactor it so it's implemented in terms of
> overridable operations. So long as people have correct implementations
> of __array_concatenate__ and __array_block__, they shouldn't care
> which one we use. In the interim period where we have
> __array_concatenate__ but there's no such thing as __array_block__,
> then that refactoring would indeed break things, so we shouldn't do
> that :-). But we could fix that by adding __array_block__.
>

""we guarantee not to take a function that used to be implemented in terms
of overridable operations, and refactor it so it's implemented in terms of
overridable operations"
Did you miss a "not" in here somewhere, e.g., "refactor it so it's NOT
implemented"?

If we ever tried to do something like this, I'm pretty sure that it just
wouldn't happen -- unless we also change NumPy's extremely conservative
approach to breaking third-party code. np.block() is much more complex to
implement than np.concatenate(), and users would resist being forced to
handle that complexity if they don't need it. (Example: TensorFlow has a
concatenate function, but not block.)

> > 2. Array libraries may prefer to implement high level functions
> differently than NumPy. For example, a library might prefer to implement a
> fundamental operations like mean() directly rather than relying on sum()
> followed by division. More generally, it’s not clear yet what exactly
> qualifies as core functionality, and figuring this out could be a large
> project.
>
> True. And this is a very general problem... for example, the
> appropriate way to implement logistic regression is very different
> in-core versus out-of-core. You're never going to be able to take code
> written for ndarray, drop in an arbitrary new array object, and get
> optimal results in all cases -- that's just way too ambitious to hope
> for. There will be cases where reducing to operations like sum() and
> division is fine. There will be cases where you have a high-level
> operation like logistic regression, where reducing to sum() and
> division doesn't work, but reducing to slightly-higher-level
> operations like np.mean also doesn't work, because you need to redo
> the whole high-level operation. And then there will be cases where
> sum() and division are too low-level, but mean() is high-level enough
> to make the critical difference. It's that last one where it's
> important to be able to override mean() directly. Are there a lot of
> cases like this?
>

mean() is not entirely hypothetical. TensorFlow and Eigen actually do
implement mean separately from sum, though to be honest it's not entirely
clear to me why:
https://github.com/tensorflow/tensorflow/blob/1c1dad105a57bb13711492a8ba5ab9d10c91b5df/tensorflow/core/kernels/reduction_ops_mean.cc
https://eigen.tuxfamily.org/dox/unsupported/TensorFunctors_8h_source.html

I do think this probably will come up with some frequency for other
operations, but the bigger answer here really is consistency -- it allows
projects and their users to have very clearly defined dependencies on
NumPy's API. They don't need to worry about any implementation details from
NumPy leaking into their override of a function.

> > 3. We don’t yet have an overloading system for attributes and methods on
> array objects, e.g., for accessing .dtype and .shape. This should be the
> subject of a future NEP, but until then we should be reluctant to rely on
> these properties.
>
> This one I don't understand. If you have a duck-array object, and you
> want to access its .dtype or .shape attributes, you just... write
> myobj.dtype or myobj.shape? That doesn't need a NEP though so I must
> be missing something :-).
>

We don't have np.asduckarray() yet or whatever we'll end up calling our
proposed casting function from NEP 22, so we don't have a fully fleshed out
mechanism for NumPy to declare "this object needs to support .shape and
.dtype, or I'm going to cast it into something that does".

More comments on the environment variable and the interface to come in my
next email...

Cheers,
Stephan

_______________________________________________
NumPy-Discussion mailing list
[email protected]
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Proposal to accept NEP-18, __array_function__ protocol

Reply via email to