Ping to finish up this discussion so we can come to a conclusion. I'm in favor of the NEP, as I don't see it as orthogonal to Nathaniel's concerns. However, we might want to be selective as to which functions we expose via the `__array_function__` method.
On Wed, Aug 15, 2018 at 10:45 AM, Stephan Hoyer <sho...@gmail.com> wrote: > Nathaniel, > > Thanks for raising these thoughtful concerns. Your independent review of > this proposal is greatly appreciated! > > See my responses inline below: > > On Mon, Aug 13, 2018 at 2:44 AM Nathaniel Smith <n...@pobox.com> wrote: > >> The other approach would be to incrementally add clean, well-defined >> dunder methods like __array_ufunc__, __array_concatenate__, etc. This >> way we end up putting some thought into each interface, making sure >> that it's something we can support, protecting downstream libraries >> from unnecessary complexity (e.g. they can implement >> __array_concatenate__ instead of hstack, vstack, row_stack, >> column_stack, ...), or avoiding adding new APIs entirely (e.g., by >> converting existing functions into ufuncs so __array_ufunc__ starts >> automagically working). And in the end we get a clean list of dunder >> methods that new array container implementations have to define. It's >> plausible to imagine a generic test suite for array containers. (I >> suspect that every library that tries to implement __array_function__ >> will end up with accidental behavioral differences, just because the >> numpy API is so vast and contains so many corner cases.) So the >> clean-well-defined-dunders approach has lots of upsides. The big >> downside is that this is a much longer road to go down. >> > > RE: accidental differences in behavior: > > I actually think that the __array_function__ approach is *less* prone to > accidental differences in behavior, because we require implementing every > function directly (or it raises an error). > > This avoids a classic subclassing problem that has plagued NumPy for > years, where overriding the behavior of method A causes apparently > unrelated method B to break, because it relied on method A internally. In > NumPy, this constrained our implementation of np.median(), because it > needed to call np.mean() in order for subclasses implementing units to work > properly. > > There will certainly be accidental differences in behavior for third-party > code that *uses* NumPy, but this is basically inevitable for any proposal > to allow's NumPy's public API to be overloaded. It's also avoided by > default by third-party libraries that follow the current best practice of > casting all input arrays with np.asarray(). > > -------------- > > RE: a hypothetical simplified interface: > > The need to implement everything you want to use in NumPy's public API > could certainly be onerous, but on the other hand there are a long list of > projects that have already done this today -- and these are the projects > that most need __array_function__. > > I'm sure there are cases were simplification would be warranted, but in > particular I don't think __array_concatenate__ has significant advantages > over simply implementing __array_function__ for np.concatenate. It's a > slightly different way of spelling, but it basically does the same thing. > The level of complexity to implement hstack, vstack, row_stack and > column_stack in terms of np.concatenate is pretty minimal. > __array_function__ implementors could easily copy and paste code from NumPy > or use a third-party helpers library (like NDArrayOperatorsMixin) that > provides such implementations. > > I also have other concerns about the "simplified API" approach beyond the > difficulty of figuring it out, but those are already mentioned in the NEP: > http://www.numpy.org/neps/nep-0018-array-function-protocol. > html#implementations-in-terms-of-a-limited-core-api > > But... this is wishful thinking. No matter what the NEP says, I simply >> don't believe that we'll actually go break dask, sparse arrays, >> xarray, and sklearn in a numpy point release. Or any numpy release. >> Nor should we. If we're serious about keeping this experimental – and >> I think that's an excellent idea for now! – then IMO we need to do >> something more to avoid getting trapped by backwards compatibility. >> > > I agree, but to be clear, development for dask, sparse and xarray (and > even broadly supported machine learning libraries like TensorFlow) still > happens at a much faster pace than is currently the case for "core" > projects in the SciPy stack like NumPy. It would not be a big deal to > encounter breaking changes in a "major" NumPy release (i.e., 1.X -> > 1.(X+1)). > > (Side note: sklearn doesn't directly implement any array types, so I don't > think it would make use of __array_function__ in any way, except possibly > to implement overloadable functions.) > Here is Travis Oliphant's talk at PyBay <https://speakerdeck.com/teoliphant/ml-in-python>, where he talks about the proliferation of arrays and interfaces in the ML/AI ecosystem among other things. I think that we should definitely try to get NumPy out there as an option in the near future. > >> My suggestion: at numpy import time, check for an envvar, like say >> NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=1. If it's not set, then all the >> __array_function__ dispatches turn into no-ops. This lets interested >> downstream libraries and users try this out, but makes sure that we >> won't have a hundred thousand end users depending on it without >> realizing. > > > > - makes it easy for end-users to check how much overhead this adds (by >> running their code with it enabled vs disabled) >> - if/when we decide to commit to supporting it for real, we just >> remove the envvar. >> > > I'm slightly concerned that the cost of reading an environment variable > with os.environ could exaggerate the performance cost of > __array_function__. It takes about 1 microsecond to read an environment > variable on my laptop, which is comparable to the full overhead of > __array_function__. So we may want to switch to an explicit Python API > instead, e.g., np.enable_experimental_array_function(). > > My bigger concern is when/how we decide to graduate __array_function__ > from requiring an explicit opt-in. We don't need to make a final decision > now, but it would be good to clear about what specifically we are waiting > for. > > I see three types of likely scenarios for changing __array_function__: > 1. We decide that the overloading the NumPy namespace in general is a bad > idea, based on either performance or predictability consequences for > third-party libraries. In this case, I imagine we would probably keep > __array_function__, but revert to a separate namespace for explicitly > overloaded functions, e.g., numpy.api. > 2. We want to keep __array_function__, but need a breaking change to the > interface (and we're really attached to keeping the name > __array_function__). > 3. We decide that specific functions should use a different interface > (e.g., switch from __array_function__ to __array_ufunc__). > > (1) and (2) are the sort of major concerns that in my mind would warrant > hiding a feature behind an experimental flag. For the most part, I expect > (1) could be resolved relatively quickly by running benchmark suites after > we have a working version of __array_function__. To be honest, I don't see > either of these rollback scenarios as terribly likely, but the downside > risk is large enough that we may want to protect ourselves for a major > release or two (6-12 months). > > (3) will be a much longer process, likely to stretch out over years at the > current pace of NumPy development. I don't think we'll want to keep an > opt-in flag for this long of a period. Rather, we may want to accept a > shorter deprecation cycle than usual. In most cases, I suspect we could > incrementally switch to new overloads while preserving the > __array_function__ overload for a release or two. > > I don't really understand the 'types' frozenset. The NEP says "it will >> be used by most __array_function__ methods, which otherwise would need >> to extract this information themselves"... but they still need to >> extract the information themselves, because they still have to examine >> each object and figure out what type it is. And, simply creating a >> frozenset costs ~0.2 µs on my laptop, which is overhead that we can't >> possibly optimize later... >> > > The most flexible alternative would be to just say that we provide an > fixed-length iterable, and return a tuple object. (In my microbenchmarks, > it's faster to make a tuple than a list or set.) In an early draft of the > NEP, I proposed exactly this, but the speed difference seemed really > marginal to me. > > I included 'types' in the interface because I really do think it's > something that almost all __array_function__ implementations should use > use. It preserves a nice separation of concerns between dispatching logic > and implementations for a new type. At least as long as __array_function__ > is experimental, I don't think we should be encouraging people to write > functions that could return NotImplemented directly and to rely entirely on > the NumPy interface. > > Many but not all implementations will need to look at argument types. This > is only really essential for cases where mixed operations between NumPy > arrays and another type are allowed. If you only implement the NumPy > interface for MyArray objects, then in the usual Python style you wouldn't > need isinstance checks. > > It's also important from an ecosystem perspective. If we don't make it > easy to get type information, my guess is that many __array_function__ > authors wouldn't bother to return NotImplemented for unexpected types, > which means that __array_function__ will break in weird ways when used with > objects from unrelated libraries. > > Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion