Hi,

One aspect of the design of "arrow2" is that it deals with array slices
differently from the rest of the implementations. Essentially, the offset
is not stored in ArrayData, but on each individual Buffer. Some important
consequence are:

* people can work with buffers and bitmaps without having to drag the
corresponding array offset with them (which are common source of
unsoundness in the official Rust implementation)
* arrays can store buffers/bitmaps with independent offsets
* it does not roundtrip over the c data interface at zero cost, because the
c data interface only allows a single offset per array, not per
buffer/bitmap.

I have been benchmarking the consequences of this design choice and reached
the conclusion that storing the offset on a per buffer basis offers at
least 15% improvement in compute (results vary on kernel and likely
implementation).

To understand why this is the case, consider comparing two boolean arrays
(a, b), where "a" has been sliced and has a validity and "b" does not. In
this case, we could compare the values of the arrays (taking into account
"a"'s offset), and clone "a"'s validity. However, this does not work
because the validity is "offsetted", while the result of the comparison of
the values is not. Thus, we need to create a shifted copy of the validity.
I measure 15% of the total compute time on my benches being done on
creating this shifted copy.

The root cause is that the C data interface declares an offset on the
ArrayData, as opposed to an offset on each of the buffers contained on it.
With an offset shared between buffers, we can't slice individual bitmap
buffers, which forbids us from leveraging the optimization of simply
cloning buffers instead of copying them.

I wonder whether this was discussed previously, or whether the "single
offset per array in the c data interface" considered this performance
implication.

Atm the solution we adopted is to incur the penalty cost of ("de-offseting
buffers") when passing offsetted arrays via the c data interface, since
this way users benefit from faster compute kernels and only incur this cost
when it is strictly needed for the C data interface, but my understanding
is that this design choice affects the compute kernels of most
implementations, since they all perform a copy to de-offset the sliced
buffers on every operation over sliced arrays?

Best,
Jorge

Reply via email to