On Tue, 2019-06-18 at 04:28 +0200, Hameer Abbasi wrote: > On Wed, 2019-06-12 at 12:55 -0500, Sebastian Berg wrote: > > On Wed, 2019-06-05 at 15:41 -0500, Sebastian Berg wrote: > > > Hi all, > > > > > > TL;DR: > > > > > > Value based promotion seems complex both for users and ufunc- > > > dispatching/promotion logic. Is there any way we can move forward > > > here, > > > and if we do, could we just risk some possible (maybe not- > > > existing) > > > corner cases to break early to get on the way? > > > > > > > Hi all, > > > > just to note. I think I will go forward trying to fill the hole in > > the > > hierarchy with a non-existing uint7 dtype. That seemed like it may > > be > > ugly, but if it does not escalate too much, it is probably fairly > > straight forward. And it would allow to simplify dispatching > > without > > any logic change at all. After that we could still decide to change > > the > > logic. > > Hi Sebastian! > > This seems like the right approach to me as well, I would just add > one > additional comment. Earlier on, you mentioned that a lot of "strange" > dtypes will pop up when dealing with floats/ints. E.g. int15, int31, > int63, int52 (for checking double-compat), int23 (single compat), > int10 > (half compat) and so on and so forth. The lookup table would get > tricky > to populate by hand --- It might be worth it to use the logic I > suggested to autogenerate it in some way, or to "determine" the > temporary underspecified type, as Nathaniel proposed in his email to > the list. That is, we store the number of: > > * flag (0 for numeric, 1 for non-numeric) > * sign bits (0 for unsigned ints, 1 else) > * integer/fraction bits (self-explanatory) > * exponent bits (self-explanatory) > * Log-Number of items (0 for real, 1 for complex, 2 for quarternion, > etc.) (I propose log because the Cayley-Dickson algebras [1] require > a > power of two) > > A type is safely castable to another if all of these numbers are > exceeded or met. > > This would give us a clean way for registering new numeric types, > while > also cleanly hooking into the type system, and solving the casting > scenario. Of course, I'm not proposing we generate the loops for or > provide all these types ourselves, but simply that we allow people to > define dtypes using such a schema. I do worry that we're special- > casing > numbers here, but it is "Num"Py, so I'm also not too worried. > > This flexibility would, for example, allow us to easily define a > bfloat16/bcomplex32 type with all the "can_cast" logic in place, even > if people have to register their own casts or loops (and just to be > clear, we error if they are not). It also makes it easy to define > loops > for int128 and so on if they come along. > > The only open question left here is: What to do with a case like > int64 > + uint64. And what I propose is we abandon purity for pragmatism here > and tell ourselves that losing one sign bit is tolerable 90% of the > time, and going to floating-point is probably worse. It's more of a > range-versus-accuracy question, and I would argue that people using > integers expect exactness. Of course, I doubt anyone is actually > relying on the fact that adding two integers produces floating-point > results, and it has been the cause of at least one bug, which > highlights that integers can be used in places where floats cannot. > [0]
P.S. Someone collected a list of issues where the automatic float- conversion breaks things, it's old but it does highlight the importance of the issue: [0] https://github.com/numpy/numpy/issues/12525#issuecomment-457727726 Hameer Abbasi > > Hameer Abbasi > > [0] https://github.com/numpy/numpy/issues/9982 > [1] https://en.wikipedia.org/wiki/Cayley%E2%80%93Dickson_construction > > > Best, > > > > Sebastian > > > > > > > ----------- > > > > > > Currently when you write code such as: > > > > > > arr = np.array([1, 43, 23], dtype=np.uint16) > > > res = arr + 1 > > > > > > Numpy uses fairly sophisticated logic to decide that `1` can be > > > represented as a uint16, and thus for all unary functions (and > > > most > > > others as well), the output will have a `res.dtype` of uint16. > > > > > > Similar logic also exists for floating point types, where a lower > > > precision floating point can be used: > > > > > > arr = np.array([1, 43, 23], dtype=np.float32) > > > (arr + np.float64(2.)).dtype # will be float32 > > > > > > Currently, this value based logic is enforced by checking whether > > > the > > > cast is possible: "4" can be cast to int8, uint8. So the first > > > call > > > above will at some point check if "uint16 + uint16 -> uint16" is > > > a > > > valid operation, find that it is, and thus stop searching. (There > > > is > > > the additional logic, that when both/all operands are scalars, it > > > is > > > not applied). > > > > > > Note that while it is defined in terms of casting "1" to uint8 > > > safely > > > being possible even though 1 may be typed as int64. This logic > > > thus > > > affects all promotion rules as well (i.e. what should the output > > > dtype > > > be). > > > > > > > > > There 2 main discussion points/issues about it: > > > > > > 1. Should value based casting/promotion logic exist at all? > > > > > > Arguably an `np.int32(3)` has type information attached to it, so > > > why > > > should we ignore it. It can also be tricky for users, because a > > > small > > > change in values can change the result data type. > > > Because 0-D arrays and scalars are too close inside numpy (you > > > will > > > often not know which one you get). There is not much option but > > > to > > > handle them identically. However, it seems pretty odd that: > > > * `np.array(3, dtype=np.int32)` + np.arange(10, dtype=int8) > > > * `np.array([3], dtype=np.int32)` + np.arange(10, dtype=int8) > > > > > > give a different result. > > > > > > This is a bit different for python scalars, which do not have a > > > type > > > attached already. > > > > > > > > > 2. Promotion and type resolution in Ufuncs: > > > > > > What is currently bothering me is that the decision what the > > > output > > > dtypes should be currently depends on the values in complicated > > > ways. > > > It would be nice if we can decide which type signature to use > > > without > > > actually looking at values (or at least only very early on). > > > > > > One reason here is caching and simplicity. I would like to be > > > able > > > to > > > cache which loop should be used for what input. Having value > > > based > > > casting in there bloats up the problem. > > > Of course it currently works OK, but especially when user dtypes > > > come > > > into play, caching would seem like a nice optimization option. > > > > > > Because `uint8(127)` can also be a `int8`, but uint8(128) it is > > > not > > > as > > > simple as finding the "minimal" dtype once and working with > > > that." > > > Of course Eric and I discussed this a bit before, and you could > > > create > > > an internal "uint7" dtype which has the only purpose of flagging > > > that > > > a > > > cast to int8 is safe. > > > > > > I suppose it is possible I am barking up the wrong tree here, and > > > this > > > caching/predictability is not vital (or can be solved with such > > > an > > > internal dtype easily, although I am not sure it seems elegant). > > > > > > > > > Possible options to move forward > > > -------------------------------- > > > > > > I have to still see a bit how trick things are. But there are a > > > few > > > possible options. I would like to move the scalar logic to the > > > beginning of ufunc calls: > > > * The uint7 idea would be one solution > > > * Simply implement something that works for numpy and all > > > except > > > strange external ufuncs (I can only think of numba as a > > > plausible > > > candidate for creating such). > > > > > > My current plan is to see where the second thing leaves me. > > > > > > We also should see if we cannot move the whole thing forward, in > > > which > > > case the main decision would have to be forward to where. My > > > opinion > > > is > > > currently that when a type has a dtype associated with it > > > clearly, > > > we > > > should always use that dtype in the future. This mostly means > > > that > > > numpy dtypes such as `np.int64` will always be treated like an > > > int64, > > > and never like a `uint8` because they happen to be castable to > > > that. > > > > > > For values without a dtype attached (read python integers, > > > floats), > > > I > > > see three options, from more complex to simpler: > > > > > > 1. Keep the current logic in place as much as possible > > > 2. Only support value based promotion for operators, e.g.: > > > `arr + scalar` may do it, but `np.add(arr, scalar)` will not. > > > The upside is that it limits the complexity to a much simpler > > > problem, the downside is that the ufunc call and operator > > > match > > > less clearly. > > > 3. Just associate python float with float64 and python integers > > > with > > > long/int64 and force users to always type them explicitly if > > > they > > > need to. > > > > > > The downside of 1. is that it doesn't help with simplifying the > > > current > > > situation all that much, because we still have the special > > > casting > > > around... > > > > > > > > > I have realized that this got much too long, so I hope it makes > > > sense. > > > I will continue to dabble along on these things a bit, so if > > > nothing > > > else maybe writing it helps me to get a bit clearer on things... > > > > > > Best, > > > > > > Sebastian > > > > > > > > > _______________________________________________ > > > NumPy-Discussion mailing list > > > NumPy-Discussion@python.org > > > https://mail.python.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@python.org > > https://mail.python.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion