On Sat, Mar 10, 2018 at 4:27 AM, Matthew Rocklin <mrock...@gmail.com> wrote: > I'm very glad to see this discussion. > > I think that coming up with a single definition of array-like may be > difficult, and that we might end up wanting to embrace duck typing instead. > > It seems to me that different array-like classes will implement different > mixtures of features. It may be difficult to pin down a single definition > that includes anything except for the most basic attributes (shape and > dtype?). Consider two extreme cases of restrictive functionality: > > LinearOperators (support dot in a numpy-like way) > Storage objects like h5py (support getitem in a numpy-like way) > > I can imagine authors of both groups saying that they should qualify as > array-like because downstream projects that consume them should not convert > them to numpy arrays in important contexts.
I think this is an important point -- there are a lot of subtleties in the interfaces that different objects might want to provide. Some interesting ones that haven't been mentioned: - a "duck array" that has everything except fancy indexing - xarray's arrays are just like numpy arrays in most ways, but they have incompatible broadcasting semantics - immutable vs. mutable arrays When faced with this kind of situation, always it's tempting to try to write down some classification system to capture every possible configuration of interesting behavior. In fact, this is one of the most classic nerd snipes; it's been catching people for literally thousands of years [1]. Most of these attempts fail though :-). So let's back up -- I probably erred in not making this more clear in the NEP, but I actually have a fairly concrete use case in mind here. What happened is, I started working on a NEP for __array_concatenate__, and my thought pattern went as follows: 1) Cool, this should work for np.concatenate. 2) But what about all the other variants, like np.row_stack. We don't want __array_row_stack__; we want to express row_stack in terms of concatenate. 3) Ok, what's row_stack? It's: np.concatenate([np.atleast_2d(arr) for arr in arrs], axis=0) 4) So I need to make atleast_2d work on duck arrays. What's atleast_2d? It's: asarray + some shape checks and indexing with newaxis 5) Okay, so I need something atleast_2d can call instead of asarray [2]. And this kind of pattern shows up everywhere inside numpy, e.g. it's the first thing inside lots of functions in np.linalg b/c they do some futzing with dtypes and shape before delegating to ufuncs, it's the first thing the mean() function does b/c it needs to check arr.dtype before proceeding, etc. etc. So, we need something we can use in these functions as a first step towards unlocking the use of duck arrays in general. But we can't realistically go through each of these functions, make an exact list of all the operations/attributes it cares about, and then come up with exactly the right type constraint for it to impose at the top. And these functions aren't generally going to work on LinearOperators or h5py datasets anyway. We also don't want to go through every function in numpy and add new arguments to control this coercion behavior. What we can do, at least to start, is to have a mechanism that passes through objects that aspire to be "complete" duck arrays, like dask arrays or sparse arrays or astropy's unit arrays, and then if it turns out that in practice people find uses for finer-grained distinctions, we can iteratively add those as a second pass. Notice that if a function starts out requiring a "complete" duck array, and then later relaxes that to accept "partial" duck arrays, that's actually increasing the domain of objects that it can act on, so it's a backwards-compatible change that we can do later. So I think we should start out with a concept of "duck array" that's fairly strong but a bit vague on the exact details (e.g., dask.array.Array is currently missing some weird things like arr.ptp() and arr.tolist(), I guess because no-one has ever noticed or cared?). ------------ Thinking things through like this, I also realized that this proposal jumps through hoops to avoid changing np.asarray itself, because I was nervous about changing the rule that its output is always an ndarray... but actually, this is currently the rule for most functions in numpy, and the whole point of this proposal is to relax that rule for most functions, in cases where the user is explicitly passing in a duck-array object. So maybe I'm being overparanoid? I'm genuinely unsure here. Instead of messing about with ABCs, an alternative mechanism would be to add a new method __arrayish__ (hat tip to Tom Caswell for the name :-)), that essentially acts as an override for Python-level calls to np.array / np.asarray, in much the same way that __array_ufunc__ overrides ufuncs, etc. (C level calls to PyArray_FromAny and similar would of course continue to return ndarray objects, and I assume we'd add some argument like require_ndarray= that you could pass to explicitly indicate whether you needed C-level compatibility.) This would also allow objects like h5py datasets to *produce* an arrayish object on demand, even if they aren't one themselves. (E.g., imagine some hdf5-like storage that holds sparse arrays instead of regular arrays.) I'm thinking I may write this option up as a second NEP, to compete with my first one. -n [1] See: https://www.wiley.com/en-us/The+Search+for+the+Perfect+Language-p-9780631205104 [2] Actually atleast_2d calls asanyarray, not asarray, but that's just a detail; the way to solve this problem for asanyarray is to first solve it for asarray. -- Nathaniel J. Smith -- https://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion