On Wed, Jul 28, 2021 at 5:39 AM Antoine Pitrou <anto...@python.org> wrote: > > > Le 28/07/2021 à 03:33, Wes McKinney a écrit : > > > > I don't have the solution worked out for this, but the basic gist is: > > > > * To be 10-100x more efficient ExecBatch slicing cannot call > > ArrayData::Slice for every field like it does now > > * Atomics associated with interacting with shared_ptr<ArrayData> / > > shared_ptr<Buffer> do add meaningful overhead > > * The way that array kernels are currently implemented would need to > > shift to accommodate the changes needed to make ExecBatch lighter > > weight. > > > > One initial option is to move the "batch offset" to the top level of > > ExecBatch (to remove the need to copy ArrayData), but then quite a bit > > of code would need to be adapted to combine that offset with the > > ArrayData's offsets to compute memory addresses. If this memory > > address arithmetic has leaked into kernel implementations, this might > > be a good opportunity to unleak it. That wouldn't fix the > > shared_ptr/atomics overhead, so I'm open to ideas about how that could > > be addressed also. > > We could have a non-owning ArraySpan: > > struct ArraySpan { > ArrayData* data; > const int64_t offset, length; > > int64_t absolute_offset() const { > return offset + data->offset; > } > }; > > And/or a more general (Datum-like) Span: > > class Span { > util::variant<ArraySpan*, Scalar*> datum_; > > public: > // Datum-like accessors > }; > > or > > class Span { > util::variant<ArrayData*, Scalar*> datum_; > const int64_t offset_, length_; > > public: > // Datum-like accessors > }; > > > Then ExecBatch could be a glorified std::vector<Span>.
Yes, something like this might work. To make this complete, we would also want to change the out-argument of ArrayKernelExec to be something lighter-weight than Datum* std::function<Status(KernelContext*, const ExecBatch&, Datum*)> One thing to navigate would be kernels that do zero-copy [1] — the output passed to a kernel would need to be a mutable Span that can communicate to zero-copy implementations whether buffers can be replaced outright or whether the span is a slice of some other ArrayData a memcopy is required. A prototype along with some benchmarks would help assess whether a proposed design addresses the slicing cost to satisfaction. From a glance through some of the kernel implementations, the porting would be a labor-intensive but probably fairly mechanical project once the details are worked out. [1]: https://github.com/apache/arrow/blob/apache-arrow-5.0.0/cpp/src/arrow/compute/kernels/scalar_cast_internal.cc#L226 > (also, Arrow would probably still benefit from a small vector > implementation... at least for internals, because we can't easily expose > it in public-facing APIs in place of regular std::vector) > > Regards > > Antoine.