Re: [C++] Compute: Datum and "ChunkedArray&" inputs

Wes McKinney Thu, 09 Apr 2020 06:28:25 -0700

On Thu, Apr 9, 2020, 5:25 AM Antoine Pitrou <solip...@pitrou.net> wrote:


>
> It seems there are two different concerns here:
> - the kernels' public API
> - the ease with which kernels can be implemented.
>
> If we need a different public API, then IMO we should change it sooner
> rather than later.
>

Yes, based on the problems I've listed I think some significant time needs
to be invested in this after 0.17.0 is released. I can allocate some of my
own time to it.


As for the implementation, perhaps we should start by drawing
> the main categories of kernels.  For example:
> - item-wise kernels (and/or vector-wise)
> - aggregate kernels
>
> Some kernels will not fall in these categories (e.g. filter, join...),
> but many of them do.
>

Yes, it's certainly important to distinguish been "operators" (which are
data flow processing nodes like joins, aggregations, filters, projections)
and functions / kernels (which perform granular units of work on concrete
batches of data). Some operators can be used in "one-shot" mode (eg
obtaining a result from a static unit of data) though.


Regards
>
> Antoine.
>
>
> On Wed, 8 Apr 2020 16:04:30 -0500
> Wes McKinney <wesmck...@gmail.com> wrote:
> > On Wed, Apr 8, 2020 at 4:01 PM Wes McKinney <wesmck...@gmail.com> wrote:
> > >
> > > Another idea would be to have a variant with const-references instead
> > > of shared_ptr. One potential issue with our Datum is that it plays the
> > > dual role of transporting both input and output arguments. With
> > > outputs it's necessary to be able to convey ownership while with
> > > inputs this is less important.
> > >
> > > In general, I would say that some additional thought is needed about
> > > our kernel APIs. Some scattered thoughts about this:
> > >
> > > * Kernel "binding" (selecting a kernel given input types) is a bit ad
> > > hoc. Many kernels do not have the ability to tell you up front what
> > > type of data will be returned (this is very important in the context
> > > of a query engine runtime)
> >
> > Another observation is that many of our kernels are unable to write
> > into preallocated memory. This is also a design issue that must be
> > addressed (kernels having the same output type invoked in succession
> > should be able to elide temporary allocations by writing into the same
> > output memory)
> >
> > > * It's unclear that having a large collection of
> > > $FUNC(FunctionContext*, Datum $arg0, ..., Datum* out) is sustainable.
> > >
> > > Rather, it might be better to have something like:
> > >
> > > // We know the "schema" of each argument but don't have any data yet
> > > std::string kernel_name = "$name";
> > > auto bound_kernel = GetKernel(kernel_name, {arg0_shape, arg1_shape},
> options);
> > > auto out = bound_kernel->Call({arg0, arg1});
> > >
> > > So here it would be
> > >
> > > ARROW_ASSIGN_OR_RAISE(auto take_kernel,
> > >                       GetKernel("take", {shapes::chunked_array(int8()),
> > >
> shapes::chunked_array(int8())}));
> > >
> > > So now take_kernel->shape() should tell you that the expected output
> > > is shapes::chunked_array(int8())
> > >
> > > And Call should preferably accept reference arguments for its input
> parameters.
> > >
> > > As far as kernel-specific options, we could create a variant that
> > > includes the different kernel-specific options structures, so that
> > > it's not necessary to have different dispatch functions for different
> > > kernels.
> > >
> > > Anyway, just some out loud thoughts, but while we are in this
> > > intermediate experimental state I'm supportive of you making the
> > > decision that is most convenient for the immediate problem you want to
> > > solve with the caveat that things may be refactored significantly in
> > > the proximate future.
> > >
> > > - Wes
> > >
> > > On Tue, Apr 7, 2020 at 10:42 AM Uwe L. Korn <uw...@xhochy.com>
> wrote:
> > > >
> > > > I did a bit more research on JIRA and we seem to have this open
> topic there also in https://issues.apache.org/jira/browse/ARROW-6959
> which is the similar topic as my mail is about and in
> https://issues.apache.org/jira/browse/ARROW-7009 we wanted to remove some
> of the interfaces with reference-types.
> > > >
> > > > On Tue, Apr 7, 2020, at 1:00 PM, Uwe L. Korn wrote:
> > > > > Hello all,
> > > > >
> > > > > I'm in the progress of changing the implementation of the Take
> kernel
> > > > > to work on ChunkedArrays without concatenating them into a single
> Array
> > > > > first. While working on the implementation, I realised that we
> switch
> > > > > often between Datum and the specific-typed parameters. This works
> quite
> > > > > well for the combination of Array& and
> Datum(shared_ptr<ArrayData>) as
> > > > > here the reference object with type Array& always carries a shared
> > > > > reference with it, so switching between Array& and its Datum is
> quite
> > > > > easy.
> > > > >
> > > > > In contrast, we cannot do this with ChunkedArrays as here the Datum
> > > > > requires a shared_ptr<ChunkedArray> which cannot be constructed
> from
> > > > > the reference type. Thus to allow interfaces like `Status
> > > > > Take(FunctionContext* ctx, const ChunkedArray& values, const Array&
> > > > > indices,` to pass successfully their arguments to the Kernel
> > > > > implementation, we have to do:
> > > > >
> > > > > a) Remove the references from the interface of the Take() function
> and
> > > > > use `shared_ptr` instances everywhere.
> > > > > b) Add interfaces to kernels like the TakeKernel that allow calling
> > > > > with specific references instead of Datum instances
> > > > >
> > > > > Personally I would prefer b) as this allow us to make more use of
> the
> > > > > C++ type system and would also avoid the shared_ptr overhead where
> not
> > > > > necessary.
> > > > >
> > > > > Cheers,
> > > > > Uwe
> > > > >
> >
>
>
>
>

Re: [C++] Compute: Datum and "ChunkedArray&" inputs

Reply via email to