Re: [DISCUSS][C++] Enabling finer-grained parallelism in compute operators, quantifying ExecBatch overhead

Wes McKinney Wed, 28 Jul 2021 14:36:20 -0700

On Wed, Jul 28, 2021 at 5:39 AM Antoine Pitrou <[email protected]> wrote:
>
>
> Le 28/07/2021 à 03:33, Wes McKinney a écrit :
> >
> > I don't have the solution worked out for this, but the basic gist is:
> >
> > * To be 10-100x more efficient ExecBatch slicing cannot call
> > ArrayData::Slice for every field like it does now
> > * Atomics associated with interacting with shared_ptr<ArrayData> /
> > shared_ptr<Buffer> do add meaningful overhead
> > * The way that array kernels are currently implemented would need to
> > shift to accommodate the changes needed to make ExecBatch lighter
> > weight.
> >
> > One initial option is to move the "batch offset" to the top level of
> > ExecBatch (to remove the need to copy ArrayData), but then quite a bit
> > of code would need to be adapted to combine that offset with the
> > ArrayData's offsets to compute memory addresses. If this memory
> > address arithmetic has leaked into kernel implementations, this might
> > be a good opportunity to unleak it. That wouldn't fix the
> > shared_ptr/atomics overhead, so I'm open to ideas about how that could
> > be addressed also.
>
> We could have a non-owning ArraySpan:
>
> struct ArraySpan {
>    ArrayData* data;
>    const int64_t offset, length;
>
>    int64_t absolute_offset() const {
>      return offset + data->offset;
>    }
> };
>
> And/or a more general (Datum-like) Span:
>
> class Span {
>    util::variant<ArraySpan*, Scalar*> datum_;
>
>   public:
>    // Datum-like accessors
> };
>
> or
>
> class Span {
>    util::variant<ArrayData*, Scalar*> datum_;
>    const int64_t offset_, length_;
>
>   public:
>    // Datum-like accessors
> };
>
>
> Then ExecBatch could be a glorified std::vector<Span>.


Yes, something like this might work. To make this complete, we would
also want to change the out-argument of ArrayKernelExec to be
something lighter-weight than Datum*

std::function<Status(KernelContext*, const ExecBatch&, Datum*)>

One thing to navigate would be kernels that do zero-copy [1] — the
output passed to a kernel would need to be a mutable Span that can
communicate to zero-copy implementations whether buffers can be
replaced outright or whether the span is a slice of some other
ArrayData a memcopy is required.

A prototype along with some benchmarks would help assess whether a
proposed design addresses the slicing cost to satisfaction. From a
glance through some of the kernel implementations, the porting would
be a labor-intensive but probably fairly mechanical project once the
details are worked out.

[1]: 
https://github.com/apache/arrow/blob/apache-arrow-5.0.0/cpp/src/arrow/compute/kernels/scalar_cast_internal.cc#L226

> (also, Arrow would probably still benefit from a small vector
> implementation... at least for internals, because we can't easily expose
> it in public-facing APIs in place of regular std::vector)
>
> Regards
>
> Antoine.

Re: [DISCUSS][C++] Enabling finer-grained parallelism in compute operators, quantifying ExecBatch overhead

Reply via email to