On Sat, Mar 10, 2018 at 4:27 AM, Matthew Rocklin <mrock...@gmail.com> wrote:
> I'm very glad to see this discussion.
>
> I think that coming up with a single definition of array-like may be
> difficult, and that we might end up wanting to embrace duck typing instead.
>
> It seems to me that different array-like classes will implement different
> mixtures of features.  It may be difficult to pin down a single definition
> that includes anything except for the most basic attributes (shape and
> dtype?).  Consider two extreme cases of restrictive functionality:
>
> LinearOperators (support dot in a numpy-like way)
> Storage objects like h5py (support getitem in a numpy-like way)
>
> I can imagine authors of both groups saying that they should qualify as
> array-like because downstream projects that consume them should not convert
> them to numpy arrays in important contexts.

I think this is an important point -- there are a lot of subtleties in
the interfaces that different objects might want to provide. Some
interesting ones that haven't been mentioned:

- a "duck array" that has everything except fancy indexing
- xarray's arrays are just like numpy arrays in most ways, but they
have incompatible broadcasting semantics
- immutable vs. mutable arrays

When faced with this kind of situation, always it's tempting to try to
write down some classification system to capture every possible
configuration of interesting behavior. In fact, this is one of the
most classic nerd snipes; it's been catching people for literally
thousands of years [1]. Most of these attempts fail though :-).

So let's back up -- I probably erred in not making this more clear in
the NEP, but I actually have a fairly concrete use case in mind here.
What happened is, I started working on a NEP for
__array_concatenate__, and my thought pattern went as follows:

1) Cool, this should work for np.concatenate.
2) But what about all the other variants, like np.row_stack. We don't
want __array_row_stack__; we want to express row_stack in terms of
concatenate.
3) Ok, what's row_stack? It's:
  np.concatenate([np.atleast_2d(arr) for arr in arrs], axis=0)
4) So I need to make atleast_2d work on duck arrays. What's
atleast_2d? It's: asarray + some shape checks and indexing with
newaxis
5) Okay, so I need something atleast_2d can call instead of asarray [2].

And this kind of pattern shows up everywhere inside numpy, e.g. it's
the first thing inside lots of functions in np.linalg b/c they do some
futzing with dtypes and shape before delegating to ufuncs, it's the
first thing the mean() function does b/c it needs to check arr.dtype
before proceeding, etc. etc.

So, we need something we can use in these functions as a first step
towards unlocking the use of duck arrays in general. But we can't
realistically go through each of these functions, make an exact list
of all the operations/attributes it cares about, and then come up with
exactly the right type constraint for it to impose at the top. And
these functions aren't generally going to work on LinearOperators or
h5py datasets anyway.

We also don't want to go through every function in numpy and add new
arguments to control this coercion behavior.

What we can do, at least to start, is to have a mechanism that passes
through objects that aspire to be "complete" duck arrays, like dask
arrays or sparse arrays or astropy's unit arrays, and then if it turns
out that in practice people find uses for finer-grained distinctions,
we can iteratively add those as a second pass. Notice that if a
function starts out requiring a "complete" duck array, and then later
relaxes that to accept "partial" duck arrays, that's actually
increasing the domain of objects that it can act on, so it's a
backwards-compatible change that we can do later.

So I think we should start out with a concept of "duck array" that's
fairly strong but a bit vague on the exact details (e.g.,
dask.array.Array is currently missing some weird things like arr.ptp()
and arr.tolist(), I guess because no-one has ever noticed or cared?).

------------

Thinking things through like this, I also realized that this proposal
jumps through hoops to avoid changing np.asarray itself, because I was
nervous about changing the rule that its output is always an
ndarray... but actually, this is currently the rule for most functions
in numpy, and the whole point of this proposal is to relax that rule
for most functions, in cases where the user is explicitly passing in a
duck-array object. So maybe I'm being overparanoid? I'm genuinely
unsure here.

Instead of messing about with ABCs, an alternative mechanism would be
to add a new method __arrayish__ (hat tip to Tom Caswell for the name
:-)), that essentially acts as an override for Python-level calls to
np.array / np.asarray, in much the same way that __array_ufunc__
overrides ufuncs, etc. (C level calls to PyArray_FromAny and similar
would of course continue to return ndarray objects, and I assume we'd
add some argument like require_ndarray= that you could pass to
explicitly indicate whether you needed C-level compatibility.)

This would also allow objects like h5py datasets to *produce* an
arrayish object on demand, even if they aren't one themselves. (E.g.,
imagine some hdf5-like storage that holds sparse arrays instead of
regular arrays.)

I'm thinking I may write this option up as a second NEP, to compete
with my first one.

-n

[1] See: 
https://www.wiley.com/en-us/The+Search+for+the+Perfect+Language-p-9780631205104
[2] Actually atleast_2d calls asanyarray, not asarray, but that's just
a detail; the way to solve this problem for asanyarray is to first
solve it for asarray.

-- 
Nathaniel J. Smith -- https://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Reply via email to