Re: [DISCUSS] Add kernel integer overflow handling

Wes McKinney Thu, 04 Jun 2020 06:23:23 -0700

On Thu, Jun 4, 2020 at 4:57 AM Krisztián Szűcs
<szucs.kriszt...@gmail.com> wrote:
>
> On Thu, Jun 4, 2020 at 11:09 AM Rémi Dettai <rdet...@gmail.com> wrote:
> >
> > It makes sense to me that the default behaviour of such a low level api as
> > kernel does not do any automagic promotion, but shouldn't this kind of
> > promotion still be requestable by the so called "system developer" user ?
> > Otherwise he would need to materialize a promoted version of each original
> > array before the kernel operation, wouldn't he ?


This will all be handled by the expression execution system. Please, I
beg your collective patience as I will be developing this in the near
future.

> I assume yes, for 1 million elements:
>
> Promotion inside kernel:
> u32 + u32 = u64
> 4MB + 4MB -> 8MB
> New allocation: 8MB
>
> Promotion outside kernel:
> (u32 -> u64) + (u32 -> u64) = u64
> (4MB -> 8MB) + (4MB -> 8MB) -> 8MB
> New allocation: 24MB

This isn't what the execution engine will do, though. The processed
chunks will be much smaller, like 64K elements or less, so at no point
will such large temporary allocations exist.

> >
> > Le mer. 3 juin 2020 à 18:27, Wes McKinney <wesmck...@gmail.com> a écrit :
> >
> > > On Wed, Jun 3, 2020 at 11:25 AM Krisztián Szűcs
> > > <szucs.kriszt...@gmail.com> wrote:
> > > >
> > > > On Wed, Jun 3, 2020 at 6:16 PM Krisztián Szűcs
> > > > <szucs.kriszt...@gmail.com> wrote:
> > > > >
> > > > > On Wed, Jun 3, 2020 at 5:52 PM Wes McKinney <wesmck...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > On Wed, Jun 3, 2020 at 10:49 AM Krisztián Szűcs
> > > > > > <szucs.kriszt...@gmail.com> wrote:
> > > > > > >
> > > > > > > From the user perspective I find the following pretty confusing:
> > > > > > >
> > > > > > > In [1]: np.array([-128, 127], dtype=np.int8()) * 2
> > > > > > > Out[1]: array([ 0, -2], dtype=int8)
> > > > > > >
> > > > > > > In [2]: np.array([-128, 127], dtype=np.int16()) * 2
> > > > > > > Out[2]: array([-256,  254], dtype=int16)
> > > > > > >
> > > > > > > In my opinion somewhere (on a higher level maybe) we should 
> > > > > > > provide
> > > > > > > the correct results promoted to a wider type implicitly.
> > > > > >
> > > > > > Yes, I agree with you, but I agree that the best place to address
> > > this
> > > > > > is at a higher level rather than having this logic implemented at 
> > > > > > the
> > > > > > lowest level (kernels) -- I think database systems handle this 
> > > > > > during
> > > > > > logical->physical planning.
> > > > >
> > > > > It raises another question: where to incorporate the implicit
> > > promotions?
> > > > > // correct me if I'm wrong but these implicit promotions are operation
> > > > > // dependent and distinct from kernel dispatching issue [1]
> > > > >
> > > > > The numpy example above can be roughly translated to:
> > > > > >>> a = pa.array([-128, 127])
> > > > > >>> pa.compute.add(a, a)
> > > > > array([ 0, -2]
> > > > >
> > > > > Which is rather surprising from the user's perspective.
> > > >
> > > > Would it be enough to document the exact behavior  and advice the user
> > > > to place casts until we have logical -> phisycal machinery?
> > >
> > > I think it's enough to clearly document the behavior and assume that
> > > the "user" will act according to what semantics are desired for their
> > > use cases. Per my comments in my last e-mail I don't think the users
> > > of these functions need to be handled with "kid's gloves".
> > >
> > > > I'm updating my PR as discussed.
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/ARROW-8919
> > > > > >
> > > > > > > Clickhouse for example does the type promotion.
> > > > > > >
> > > > > > > On Wed, Jun 3, 2020 at 5:29 PM Antoine Pitrou 
> > > > > > > <solip...@pitrou.net>
> > > wrote:
> > > > > > > >
> > > > > > > > On Wed, 3 Jun 2020 10:47:38 -0400
> > > > > > > > Ben Kietzman <ben.kietz...@rstudio.com> wrote:
> > > > > > > > >
> > > https://github.com/apache/arrow/pull/7341#issuecomment-638241193
> > > > > > > > >
> > > > > > > > > How should arithmetic kernels handle integer overflow?
> > > > > > > > >
> > > > > > > > > The approach currently taken in the linked PR is to promote
> > > such that
> > > > > > > > > overflow will not occur, for example `(int8, int8)->int16` and
> > > `(uint16,
> > > > > > > > > uint16)->uint32`.
> > > > > > > > >
> > > > > > > > > I'm not sure that's desirable. For one thing this leads to
> > > inconsistent
> > > > > > > > > handling of 64 bit integer types, which are currently allowed
> > > to overflow
> > > > > > > > > since we cannot promote further (NB: that means this kernel
> > > includes
> > > > > > > > > undefined behavior for int64).
> > > > > > > >
> > > > > > > > I agree with you.  I would strongly advise against implicit
> > > promotion
> > > > > > > > accross arithmetic operations.  We initially did that in Numba
> > > and it
> > > > > > > > quickly became a can of worms.
> > > > > > > >
> > > > > > > > The most desirable behaviour IMHO is to keep the original type,
> > > so:
> > > > > > > > - (int8, int8) -> int8
> > > > > > > > - (uint16, uint16) -> uint16
> > > > > > > >
> > > > > > > > Then the question is what happens when the actual overflow
> > > occurs.  I
> > > > > > > > think this should be directed by a kernel option.  By default an
> > > error
> > > > > > > > should probably be raised (letting errors pass and silently
> > > produce
> > > > > > > > erroneous data is wrong), but we might want to allow people to
> > > bypass
> > > > > > > > overflow checks for speed.
> > > > > > > >
> > > > > > > > Note that even if overflow detection is enabled, it *should* be
> > > possible
> > > > > > > > to enable vectorization, e.g. by making overflow detection a
> > > separate
> > > > > > > > pass (itself vectorizable).
> > > > > > > >
> > > > > > > > Regards
> > > > > > > >
> > > > > > > > Antoine.
> > > > > > > >
> > > > > > > >
> > >

Re: [DISCUSS] Add kernel integer overflow handling

Reply via email to