*internal.h headers

Benjamin Kietzman Tue, 17 Nov 2020 07:55:35 -0800

Hi Niranda,

hastebin: That looks generally correct, though I should warn you that a
recent PR
( https://github.com/apache/arrow/pull/8574 ) changed the return type of
DispatchExact to Kernel so you'll need to insert an explicit cast to
ScalarAggregateKernel.


1: This seems like a feature which might be generally useful to consumers of
the compute module, so it's probably worth adding to the KernelState
interface
in some way rather than exposing individual implementations. I've opened
https://issues.apache.org/jira/browse/ARROW-10630 to track this feature

2: I would not expect your container to contain a large number of
KernelStates.
Specifically: why would you need more than one per local thread?
Additionally
for the specific case of aggregation I'd expect that KernelStates not
actively in
use would be `merge`d and no longer stored. With small numbers of instances
I would expect the memory overhead due to polymorphism to be negligible.

On Mon, Nov 16, 2020 at 7:03 PM Niranda Perera <[email protected]>
wrote:

> Hi Ben and Wes,
> Based on our discussion, I did the following.
> https://hastebin.com/ajadonados.cpp
>
> It seems to be working fine. Would love to get your feedback on this! :-)
>
> But I have a couple of concerns.
> 1. Say I want to communicate the intermediate state data across multiple
> processes. Unfortunately, KernelState struct does not expose the data
> pointer to the outside. If say SumState is exposed, we could have accessed
> that data, isn't it? WDYT?
> 2. Polymorphism and virtual functions - Intuitively, a mean aggregation
> intermediate state would be a {T, int64} tuple. But I believe the size of
> struct "SumImpl : public ScalarAggregator (:public KernelState)" would be
> sizeof(T) + sizeof(int64) + sizeof(ScalarAggregator) + sizeof(vptr), isn't
> it? So, if I am managing a compute state container, this means that my
> memory requirement would be higher than simply using a {T, int64} tuple.
> Please correct me if I am wrong. I am not sure if there is a better
> solution to this, but just want to discuss it with you.
>
>
> On Tue, Nov 10, 2020 at 9:44 AM Wes McKinney <[email protected]> wrote:
>
>> Yes, open a Jira and propose a PR implementing the changes you need
>>
>> On Mon, Nov 9, 2020 at 8:31 PM Niranda Perera <[email protected]>
>> wrote:
>> >
>> > @wes How should I proceed with this nevertheless? should I open a JIRA?
>> >
>> > On Mon, Nov 9, 2020 at 11:09 AM Wes McKinney <[email protected]>
>> wrote:
>> >
>> > > On Mon, Nov 9, 2020 at 9:32 AM Niranda Perera <
>> [email protected]>
>> > > wrote:
>> > > >
>> > > > @Ben
>> > > > Thank you very much for the feedback. But unfortunately, I was
>> unable to
>> > > > find a header that exposes a SumAggregateKernel in the v2.0.0.
>> Maybe I am
>> > > > checking it wrong. I remember accessing them in v0.16 IINM.
>> > > >
>> > > > @Wes
>> > > > Yes, that would be great. How about adding a CMake compilation flag
>> for
>> > > > such dev use cases?
>> > > >
>> > >
>> > > This seems like it could cause more problems -- I think it would be
>> > > sufficient to use an "internal::" C++ namespace and always install the
>> > > relevant header file
>> > >
>> > > >
>> > > >
>> > > > On Sun, Nov 8, 2020 at 9:14 PM Wes McKinney <[email protected]>
>> wrote:
>> > > >
>> > > > > I'm not opposed to installing headers that provide access to some
>> of
>> > > > > the kernel implementation internals (with the caveat that changes
>> > > > > won't go through a deprecation cycle, so caveat emptor). It might
>> be
>> > > > > more sustainable to think about what kind of stable-ish public API
>> > > > > could be exported to support applications like Cylon.
>> > > > >
>> > > > > On Sun, Nov 8, 2020 at 10:37 AM Ben Kietzman <
>> [email protected]>
>> > > > > wrote:
>> > > > > >
>> > > > > > Hi Niranda,
>> > > > > >
>> > > > > > SumImpl is a subclass of KernelState. Given a
>> SumAggregateKernel,
>> > > one can
>> > > > > > produce zeroed KernelState using the `init` member, then
>> operate on
>> > > data
>> > > > > > using the `consume`, `merge`, and `finalize` members. You can
>> look at
>> > > > > > ScalarAggExecutor for an example of how to get from a compute
>> > > function to
>> > > > > > kernels and kernel state. Will that work for you?
>> > > > > >
>> > > > > > Ben Kietzman
>> > > > > >
>> > > > > > On Sun, Nov 8, 2020, 11:21 Niranda Perera <
>> [email protected]>
>> > > > > wrote:
>> > > > > >
>> > > > > > > Hi Ben,
>> > > > > > >
>> > > > > > > We are building a distributed table abstraction on top of
>> Arrow
>> > > > > dataframes
>> > > > > > > called Cylon (https://github.com/cylondata/cylon). Currently
>> we
>> > > have a
>> > > > > > > simple aggregation and group-by operation implementation. But
>> we
>> > > felt
>> > > > > like
>> > > > > > > we can give more functionality if we can import arrow kernels
>> and
>> > > > > states to
>> > > > > > > corresponding cylon distributed kernels.
>> > > > > > > Ex: For distributed mean, we would have to communicate the
>> local
>> > > arrow
>> > > > > > > SumState and then do a SumImpl::MergeFrom() and the call
>> Finalize.
>> > > > > > > Is there any other way to access these intermediate states
>> from
>> > > compute
>> > > > > > > operations?
>> > > > > > >
>> > > > > > > On Sun, Nov 8, 2020 at 11:11 AM Ben Kietzman <
>> > > [email protected]>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Ni Niranda,
>> > > > > > > >
>> > > > > > > > What is the context of your work? if you're working inside
>> the
>> > > arrow
>> > > > > > > > repository you shouldn't need to install headers before
>> using
>> > > them,
>> > > > > and
>> > > > > > > we
>> > > > > > > > welcome PRs for new kernels. Otherwise, could you provide
>> some
>> > > > > details
>> > > > > > > > about how your work is using Arrow as a dependency?
>> > > > > > > >
>> > > > > > > > Ben Kietzman
>> > > > > > > >
>> > > > > > > > On Sun, Nov 8, 2020, 10:57 Niranda Perera <
>> > > [email protected]>
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi,
>> > > > > > > > >
>> > > > > > > > > I was wondering if I could use the
>> > > > > arrow/compute/kernels/*internal.h
>> > > > > > > > > headers in my work? I would like to reuse some of the
>> kernel
>> > > > > > > > > implementations and kernel states.
>> > > > > > > > >
>> > > > > > > > > With -DARROW_COMPUTE=ON, those headers are not added into
>> the
>> > > > > include
>> > > > > > > > dir.
>> > > > > > > > > I see that the *internal.h headers are skipped from
>> > > > > > > > > the ARROW_INSTALL_ALL_HEADERS cmake function
>> unfortunately.
>> > > > > > > > >
>> > > > > > > > > Best
>> > > > > > > > > --
>> > > > > > > > > Niranda Perera
>> > > > > > > > > @n1r44 <https://twitter.com/N1R44>
>> > > > > > > > > +1 812 558 8884 / +94 71 554 8430
>> > > > > > > > > https://www.linkedin.com/in/niranda
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > --
>> > > > > > > Niranda Perera
>> > > > > > > @n1r44 <https://twitter.com/N1R44>
>> > > > > > > +1 812 558 8884 / +94 71 554 8430
>> > > > > > > https://www.linkedin.com/in/niranda
>> > > > > > >
>> > > > >
>> > > >
>> > > >
>> > > > --
>> > > > Niranda Perera
>> > > > @n1r44 <https://twitter.com/N1R44>
>> > > > +1 812 558 8884 / +94 71 554 8430
>> > > > https://www.linkedin.com/in/niranda
>> > >
>> >
>> >
>> > --
>> > Niranda Perera
>> > @n1r44 <https://twitter.com/N1R44>
>> > +1 812 558 8884 / +94 71 554 8430
>> > https://www.linkedin.com/in/niranda
>>
>
>
> --
> Niranda Perera
> @n1r44 <https://twitter.com/N1R44>
> +1 812 558 8884 / +94 71 554 8430
> https://www.linkedin.com/in/niranda
>

Re: Using arrow/compute/kernels/*internal.h headers

Reply via email to