On Thu, Apr 9, 2020, 5:25 AM Antoine Pitrou <solip...@pitrou.net> wrote:
> > It seems there are two different concerns here: > - the kernels' public API > - the ease with which kernels can be implemented. > > If we need a different public API, then IMO we should change it sooner > rather than later. > Yes, based on the problems I've listed I think some significant time needs to be invested in this after 0.17.0 is released. I can allocate some of my own time to it. As for the implementation, perhaps we should start by drawing > the main categories of kernels. For example: > - item-wise kernels (and/or vector-wise) > - aggregate kernels > > Some kernels will not fall in these categories (e.g. filter, join...), > but many of them do. > Yes, it's certainly important to distinguish been "operators" (which are data flow processing nodes like joins, aggregations, filters, projections) and functions / kernels (which perform granular units of work on concrete batches of data). Some operators can be used in "one-shot" mode (eg obtaining a result from a static unit of data) though. Regards > > Antoine. > > > On Wed, 8 Apr 2020 16:04:30 -0500 > Wes McKinney <wesmck...@gmail.com> wrote: > > On Wed, Apr 8, 2020 at 4:01 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > > Another idea would be to have a variant with const-references instead > > > of shared_ptr. One potential issue with our Datum is that it plays the > > > dual role of transporting both input and output arguments. With > > > outputs it's necessary to be able to convey ownership while with > > > inputs this is less important. > > > > > > In general, I would say that some additional thought is needed about > > > our kernel APIs. Some scattered thoughts about this: > > > > > > * Kernel "binding" (selecting a kernel given input types) is a bit ad > > > hoc. Many kernels do not have the ability to tell you up front what > > > type of data will be returned (this is very important in the context > > > of a query engine runtime) > > > > Another observation is that many of our kernels are unable to write > > into preallocated memory. This is also a design issue that must be > > addressed (kernels having the same output type invoked in succession > > should be able to elide temporary allocations by writing into the same > > output memory) > > > > > * It's unclear that having a large collection of > > > $FUNC(FunctionContext*, Datum $arg0, ..., Datum* out) is sustainable. > > > > > > Rather, it might be better to have something like: > > > > > > // We know the "schema" of each argument but don't have any data yet > > > std::string kernel_name = "$name"; > > > auto bound_kernel = GetKernel(kernel_name, {arg0_shape, arg1_shape}, > options); > > > auto out = bound_kernel->Call({arg0, arg1}); > > > > > > So here it would be > > > > > > ARROW_ASSIGN_OR_RAISE(auto take_kernel, > > > GetKernel("take", {shapes::chunked_array(int8()), > > > > shapes::chunked_array(int8())})); > > > > > > So now take_kernel->shape() should tell you that the expected output > > > is shapes::chunked_array(int8()) > > > > > > And Call should preferably accept reference arguments for its input > parameters. > > > > > > As far as kernel-specific options, we could create a variant that > > > includes the different kernel-specific options structures, so that > > > it's not necessary to have different dispatch functions for different > > > kernels. > > > > > > Anyway, just some out loud thoughts, but while we are in this > > > intermediate experimental state I'm supportive of you making the > > > decision that is most convenient for the immediate problem you want to > > > solve with the caveat that things may be refactored significantly in > > > the proximate future. > > > > > > - Wes > > > > > > On Tue, Apr 7, 2020 at 10:42 AM Uwe L. Korn <uw...@xhochy.com> > wrote: > > > > > > > > I did a bit more research on JIRA and we seem to have this open > topic there also in https://issues.apache.org/jira/browse/ARROW-6959 > which is the similar topic as my mail is about and in > https://issues.apache.org/jira/browse/ARROW-7009 we wanted to remove some > of the interfaces with reference-types. > > > > > > > > On Tue, Apr 7, 2020, at 1:00 PM, Uwe L. Korn wrote: > > > > > Hello all, > > > > > > > > > > I'm in the progress of changing the implementation of the Take > kernel > > > > > to work on ChunkedArrays without concatenating them into a single > Array > > > > > first. While working on the implementation, I realised that we > switch > > > > > often between Datum and the specific-typed parameters. This works > quite > > > > > well for the combination of Array& and > Datum(shared_ptr<ArrayData>) as > > > > > here the reference object with type Array& always carries a shared > > > > > reference with it, so switching between Array& and its Datum is > quite > > > > > easy. > > > > > > > > > > In contrast, we cannot do this with ChunkedArrays as here the Datum > > > > > requires a shared_ptr<ChunkedArray> which cannot be constructed > from > > > > > the reference type. Thus to allow interfaces like `Status > > > > > Take(FunctionContext* ctx, const ChunkedArray& values, const Array& > > > > > indices,` to pass successfully their arguments to the Kernel > > > > > implementation, we have to do: > > > > > > > > > > a) Remove the references from the interface of the Take() function > and > > > > > use `shared_ptr` instances everywhere. > > > > > b) Add interfaces to kernels like the TakeKernel that allow calling > > > > > with specific references instead of Datum instances > > > > > > > > > > Personally I would prefer b) as this allow us to make more use of > the > > > > > C++ type system and would also avoid the shared_ptr overhead where > not > > > > > necessary. > > > > > > > > > > Cheers, > > > > > Uwe > > > > > > > > > > >