On Wed, Jun 3, 2020 at 11:16 AM Krisztián Szűcs <szucs.kriszt...@gmail.com> wrote: > > On Wed, Jun 3, 2020 at 5:52 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > On Wed, Jun 3, 2020 at 10:49 AM Krisztián Szűcs > > <szucs.kriszt...@gmail.com> wrote: > > > > > > From the user perspective I find the following pretty confusing: > > > > > > In [1]: np.array([-128, 127], dtype=np.int8()) * 2 > > > Out[1]: array([ 0, -2], dtype=int8) > > > > > > In [2]: np.array([-128, 127], dtype=np.int16()) * 2 > > > Out[2]: array([-256, 254], dtype=int16) > > > > > > In my opinion somewhere (on a higher level maybe) we should provide > > > the correct results promoted to a wider type implicitly. > > > > Yes, I agree with you, but I agree that the best place to address this > > is at a higher level rather than having this logic implemented at the > > lowest level (kernels) -- I think database systems handle this during > > logical->physical planning. > > It raises another question: where to incorporate the implicit promotions? > // correct me if I'm wrong but these implicit promotions are operation > // dependent and distinct from kernel dispatching issue [1]
Implicit promotions would be handled when translating a logical expression (like "Add(x, y)") to a physical bound expression (something like "add_kernel[int16, int16](cast(x, int16), cast(y, int16))", where x and y are int8). > > The numpy example above can be roughly translated to: > >>> a = pa.array([-128, 127]) > >>> pa.compute.add(a, a) > array([ 0, -2] > > Which is rather surprising from the user's perspective. I guess we have to qualify what is a "user". The direct "user" of the kernels is not the same as a user who would write a SQL query or use a data frame library. I believe the contents of pyarrow.compute (and most of pyarrow, FWIW) are intended for system developers not end users (e.g. "people who use pandas"). > > [1] https://issues.apache.org/jira/browse/ARROW-8919 > > > > > Clickhouse for example does the type promotion. > > > > > > On Wed, Jun 3, 2020 at 5:29 PM Antoine Pitrou <solip...@pitrou.net> wrote: > > > > > > > > On Wed, 3 Jun 2020 10:47:38 -0400 > > > > Ben Kietzman <ben.kietz...@rstudio.com> wrote: > > > > > https://github.com/apache/arrow/pull/7341#issuecomment-638241193 > > > > > > > > > > How should arithmetic kernels handle integer overflow? > > > > > > > > > > The approach currently taken in the linked PR is to promote such that > > > > > overflow will not occur, for example `(int8, int8)->int16` and > > > > > `(uint16, > > > > > uint16)->uint32`. > > > > > > > > > > I'm not sure that's desirable. For one thing this leads to > > > > > inconsistent > > > > > handling of 64 bit integer types, which are currently allowed to > > > > > overflow > > > > > since we cannot promote further (NB: that means this kernel includes > > > > > undefined behavior for int64). > > > > > > > > I agree with you. I would strongly advise against implicit promotion > > > > accross arithmetic operations. We initially did that in Numba and it > > > > quickly became a can of worms. > > > > > > > > The most desirable behaviour IMHO is to keep the original type, so: > > > > - (int8, int8) -> int8 > > > > - (uint16, uint16) -> uint16 > > > > > > > > Then the question is what happens when the actual overflow occurs. I > > > > think this should be directed by a kernel option. By default an error > > > > should probably be raised (letting errors pass and silently produce > > > > erroneous data is wrong), but we might want to allow people to bypass > > > > overflow checks for speed. > > > > > > > > Note that even if overflow detection is enabled, it *should* be possible > > > > to enable vectorization, e.g. by making overflow detection a separate > > > > pass (itself vectorizable). > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > >