Hi all, Maybe to clarify this at least a little, here are some examples for what currently happen and what I could imagine we can go to (all in terms of output dtype).
float32_arr = np.ones(10, dtype=np.float32) int8_arr = np.ones(10, dtype=np.int8) uint8_arr = np.ones(10, dtype=np.uint8) Current behaviour: ------------------ float32_arr + 12. # float32 float32_arr + 2**200 # float64 (because np.float32(2**200) == np.inf) int8_arr + 127 # int8 int8_arr + 128 # int16 int8_arr + 2**20 # int32 uint8_arr + -1 # uint16 # But only for arrays that are not 0d: int8_arr + np.array(1, dtype=np.int32) # int8 int8_arr + np.array([1], dtype=np.int32) # int32 # When the actual typing is given, this does not change: float32_arr + np.float64(12.) # float32 float32_arr + np.array(12., dtype=np.float64) # float32 # Except for inexact types, or complex: int8_arr + np.float16(3) # float16 (same as array behaviour) # The exact same happens with all ufuncs: np.add(float32_arr, 1) # float32 np.add(float32_arr, np.array(12., dtype=np.float64) # float32 Keeping Value based casting only for python types ------------------------------------------------- In this case, most examples above stay unchanged, because they use plain python integers or floats, such as 2, 127, 12., 3, ... without any type information attached, such as `np.float64(12.)`. These change for example: float32_arr + np.float64(12.) # float64 float32_arr + np.array(12., dtype=np.float64) # float64 np.add(float32_arr, np.array(12., dtype=np.float64) # float64 # so if you use `np.int32` it will be the same as np.uint64(10000) int8_arr + np.int32(1) # int32 int8_arr + np.int32(2**20) # int32 Remove Value based casting completely ------------------------------------- We could simply abolish it completely, a python `1` would always behave the same as `np.int_(1)`. The downside of this is that: int8_arr + 1 # int64 (or int32) uses much more memory suddenly. Or, we remove it from ufuncs, but not from operators: int8_arr + 1 # int8 dtype but: np.add(int8_arr, 1) # int64 # same as: np.add(int8_arr, np.array(1)) # int16 The main reason why I was wondering about that is that for operators the logic seems fairly simple, but for general ufuncs it seems more complex. Best, Sebastian On Wed, 2019-06-05 at 15:41 -0500, Sebastian Berg wrote: > Hi all, > > TL;DR: > > Value based promotion seems complex both for users and ufunc- > dispatching/promotion logic. Is there any way we can move forward > here, > and if we do, could we just risk some possible (maybe not-existing) > corner cases to break early to get on the way? > > ----------- > > Currently when you write code such as: > > arr = np.array([1, 43, 23], dtype=np.uint16) > res = arr + 1 > > Numpy uses fairly sophisticated logic to decide that `1` can be > represented as a uint16, and thus for all unary functions (and most > others as well), the output will have a `res.dtype` of uint16. > > Similar logic also exists for floating point types, where a lower > precision floating point can be used: > > arr = np.array([1, 43, 23], dtype=np.float32) > (arr + np.float64(2.)).dtype # will be float32 > > Currently, this value based logic is enforced by checking whether the > cast is possible: "4" can be cast to int8, uint8. So the first call > above will at some point check if "uint16 + uint16 -> uint16" is a > valid operation, find that it is, and thus stop searching. (There is > the additional logic, that when both/all operands are scalars, it is > not applied). > > Note that while it is defined in terms of casting "1" to uint8 safely > being possible even though 1 may be typed as int64. This logic thus > affects all promotion rules as well (i.e. what should the output > dtype > be). > > > There 2 main discussion points/issues about it: > > 1. Should value based casting/promotion logic exist at all? > > Arguably an `np.int32(3)` has type information attached to it, so why > should we ignore it. It can also be tricky for users, because a small > change in values can change the result data type. > Because 0-D arrays and scalars are too close inside numpy (you will > often not know which one you get). There is not much option but to > handle them identically. However, it seems pretty odd that: > * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8) > * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8) > > give a different result. > > This is a bit different for python scalars, which do not have a type > attached already. > > > 2. Promotion and type resolution in Ufuncs: > > What is currently bothering me is that the decision what the output > dtypes should be currently depends on the values in complicated ways. > It would be nice if we can decide which type signature to use without > actually looking at values (or at least only very early on). > > One reason here is caching and simplicity. I would like to be able to > cache which loop should be used for what input. Having value based > casting in there bloats up the problem. > Of course it currently works OK, but especially when user dtypes come > into play, caching would seem like a nice optimization option. > > Because `uint8(127)` can also be a `int8`, but uint8(128) it is not > as > simple as finding the "minimal" dtype once and working with that." > Of course Eric and I discussed this a bit before, and you could > create > an internal "uint7" dtype which has the only purpose of flagging that > a > cast to int8 is safe. > > I suppose it is possible I am barking up the wrong tree here, and > this > caching/predictability is not vital (or can be solved with such an > internal dtype easily, although I am not sure it seems elegant). > > > Possible options to move forward > -------------------------------- > > I have to still see a bit how trick things are. But there are a few > possible options. I would like to move the scalar logic to the > beginning of ufunc calls: > * The uint7 idea would be one solution > * Simply implement something that works for numpy and all except > strange external ufuncs (I can only think of numba as a plausible > candidate for creating such). > > My current plan is to see where the second thing leaves me. > > We also should see if we cannot move the whole thing forward, in > which > case the main decision would have to be forward to where. My opinion > is > currently that when a type has a dtype associated with it clearly, we > should always use that dtype in the future. This mostly means that > numpy dtypes such as `np.int64` will always be treated like an int64, > and never like a `uint8` because they happen to be castable to that. > > For values without a dtype attached (read python integers, floats), I > see three options, from more complex to simpler: > > 1. Keep the current logic in place as much as possible > 2. Only support value based promotion for operators, e.g.: > `arr + scalar` may do it, but `np.add(arr, scalar)` will not. > The upside is that it limits the complexity to a much simpler > problem, the downside is that the ufunc call and operator match > less clearly. > 3. Just associate python float with float64 and python integers with > long/int64 and force users to always type them explicitly if they > need to. > > The downside of 1. is that it doesn't help with simplifying the > current > situation all that much, because we still have the special casting > around... > > > I have realized that this got much too long, so I hope it makes > sense. > I will continue to dabble along on these things a bit, so if nothing > else maybe writing it helps me to get a bit clearer on things... > > Best, > > Sebastian > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion
signature.asc
Description: This is a digitally signed message part
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion