I documented [1] the behaviors by experimentation or by reading the documentation. My experiments were mostly about checking INT64_MAX + 1. My preference would be to use the platform defined behavior by default and provide a safety option that errors.
Feel free to add more databases/systems. François [1] https://docs.google.com/spreadsheets/d/1DTFER2arYyNkjEHd1jCLdhQ90kRHbNrVI-UvHvQiP6Y/edit?usp=sharing On Thu, Jun 4, 2020 at 9:22 AM Wes McKinney <wesmck...@gmail.com> wrote: > > On Thu, Jun 4, 2020 at 4:57 AM Krisztián Szűcs > <szucs.kriszt...@gmail.com> wrote: > > > > On Thu, Jun 4, 2020 at 11:09 AM Rémi Dettai <rdet...@gmail.com> wrote: > > > > > > It makes sense to me that the default behaviour of such a low level api as > > > kernel does not do any automagic promotion, but shouldn't this kind of > > > promotion still be requestable by the so called "system developer" user ? > > > Otherwise he would need to materialize a promoted version of each original > > > array before the kernel operation, wouldn't he ? > > This will all be handled by the expression execution system. Please, I > beg your collective patience as I will be developing this in the near > future. > > > I assume yes, for 1 million elements: > > > > Promotion inside kernel: > > u32 + u32 = u64 > > 4MB + 4MB -> 8MB > > New allocation: 8MB > > > > Promotion outside kernel: > > (u32 -> u64) + (u32 -> u64) = u64 > > (4MB -> 8MB) + (4MB -> 8MB) -> 8MB > > New allocation: 24MB > > This isn't what the execution engine will do, though. The processed > chunks will be much smaller, like 64K elements or less, so at no point > will such large temporary allocations exist. > > > > > > > Le mer. 3 juin 2020 à 18:27, Wes McKinney <wesmck...@gmail.com> a écrit : > > > > > > > On Wed, Jun 3, 2020 at 11:25 AM Krisztián Szűcs > > > > <szucs.kriszt...@gmail.com> wrote: > > > > > > > > > > On Wed, Jun 3, 2020 at 6:16 PM Krisztián Szűcs > > > > > <szucs.kriszt...@gmail.com> wrote: > > > > > > > > > > > > On Wed, Jun 3, 2020 at 5:52 PM Wes McKinney <wesmck...@gmail.com> > > > > wrote: > > > > > > > > > > > > > > On Wed, Jun 3, 2020 at 10:49 AM Krisztián Szűcs > > > > > > > <szucs.kriszt...@gmail.com> wrote: > > > > > > > > > > > > > > > > From the user perspective I find the following pretty confusing: > > > > > > > > > > > > > > > > In [1]: np.array([-128, 127], dtype=np.int8()) * 2 > > > > > > > > Out[1]: array([ 0, -2], dtype=int8) > > > > > > > > > > > > > > > > In [2]: np.array([-128, 127], dtype=np.int16()) * 2 > > > > > > > > Out[2]: array([-256, 254], dtype=int16) > > > > > > > > > > > > > > > > In my opinion somewhere (on a higher level maybe) we should > > > > > > > > provide > > > > > > > > the correct results promoted to a wider type implicitly. > > > > > > > > > > > > > > Yes, I agree with you, but I agree that the best place to address > > > > this > > > > > > > is at a higher level rather than having this logic implemented at > > > > > > > the > > > > > > > lowest level (kernels) -- I think database systems handle this > > > > > > > during > > > > > > > logical->physical planning. > > > > > > > > > > > > It raises another question: where to incorporate the implicit > > > > promotions? > > > > > > // correct me if I'm wrong but these implicit promotions are > > > > > > operation > > > > > > // dependent and distinct from kernel dispatching issue [1] > > > > > > > > > > > > The numpy example above can be roughly translated to: > > > > > > >>> a = pa.array([-128, 127]) > > > > > > >>> pa.compute.add(a, a) > > > > > > array([ 0, -2] > > > > > > > > > > > > Which is rather surprising from the user's perspective. > > > > > > > > > > Would it be enough to document the exact behavior and advice the user > > > > > to place casts until we have logical -> phisycal machinery? > > > > > > > > I think it's enough to clearly document the behavior and assume that > > > > the "user" will act according to what semantics are desired for their > > > > use cases. Per my comments in my last e-mail I don't think the users > > > > of these functions need to be handled with "kid's gloves". > > > > > > > > > I'm updating my PR as discussed. > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/ARROW-8919 > > > > > > > > > > > > > > > Clickhouse for example does the type promotion. > > > > > > > > > > > > > > > > On Wed, Jun 3, 2020 at 5:29 PM Antoine Pitrou > > > > > > > > <solip...@pitrou.net> > > > > wrote: > > > > > > > > > > > > > > > > > > On Wed, 3 Jun 2020 10:47:38 -0400 > > > > > > > > > Ben Kietzman <ben.kietz...@rstudio.com> wrote: > > > > > > > > > > > > > > https://github.com/apache/arrow/pull/7341#issuecomment-638241193 > > > > > > > > > > > > > > > > > > > > How should arithmetic kernels handle integer overflow? > > > > > > > > > > > > > > > > > > > > The approach currently taken in the linked PR is to promote > > > > such that > > > > > > > > > > overflow will not occur, for example `(int8, int8)->int16` > > > > > > > > > > and > > > > `(uint16, > > > > > > > > > > uint16)->uint32`. > > > > > > > > > > > > > > > > > > > > I'm not sure that's desirable. For one thing this leads to > > > > inconsistent > > > > > > > > > > handling of 64 bit integer types, which are currently > > > > > > > > > > allowed > > > > to overflow > > > > > > > > > > since we cannot promote further (NB: that means this kernel > > > > includes > > > > > > > > > > undefined behavior for int64). > > > > > > > > > > > > > > > > > > I agree with you. I would strongly advise against implicit > > > > promotion > > > > > > > > > accross arithmetic operations. We initially did that in Numba > > > > and it > > > > > > > > > quickly became a can of worms. > > > > > > > > > > > > > > > > > > The most desirable behaviour IMHO is to keep the original > > > > > > > > > type, > > > > so: > > > > > > > > > - (int8, int8) -> int8 > > > > > > > > > - (uint16, uint16) -> uint16 > > > > > > > > > > > > > > > > > > Then the question is what happens when the actual overflow > > > > occurs. I > > > > > > > > > think this should be directed by a kernel option. By default > > > > > > > > > an > > > > error > > > > > > > > > should probably be raised (letting errors pass and silently > > > > produce > > > > > > > > > erroneous data is wrong), but we might want to allow people to > > > > bypass > > > > > > > > > overflow checks for speed. > > > > > > > > > > > > > > > > > > Note that even if overflow detection is enabled, it *should* > > > > > > > > > be > > > > possible > > > > > > > > > to enable vectorization, e.g. by making overflow detection a > > > > separate > > > > > > > > > pass (itself vectorizable). > > > > > > > > > > > > > > > > > > Regards > > > > > > > > > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > >