Hi, One aspect of the design of "arrow2" is that it deals with array slices differently from the rest of the implementations. Essentially, the offset is not stored in ArrayData, but on each individual Buffer. Some important consequence are:
* people can work with buffers and bitmaps without having to drag the corresponding array offset with them (which are common source of unsoundness in the official Rust implementation) * arrays can store buffers/bitmaps with independent offsets * it does not roundtrip over the c data interface at zero cost, because the c data interface only allows a single offset per array, not per buffer/bitmap. I have been benchmarking the consequences of this design choice and reached the conclusion that storing the offset on a per buffer basis offers at least 15% improvement in compute (results vary on kernel and likely implementation). To understand why this is the case, consider comparing two boolean arrays (a, b), where "a" has been sliced and has a validity and "b" does not. In this case, we could compare the values of the arrays (taking into account "a"'s offset), and clone "a"'s validity. However, this does not work because the validity is "offsetted", while the result of the comparison of the values is not. Thus, we need to create a shifted copy of the validity. I measure 15% of the total compute time on my benches being done on creating this shifted copy. The root cause is that the C data interface declares an offset on the ArrayData, as opposed to an offset on each of the buffers contained on it. With an offset shared between buffers, we can't slice individual bitmap buffers, which forbids us from leveraging the optimization of simply cloning buffers instead of copying them. I wonder whether this was discussed previously, or whether the "single offset per array in the c data interface" considered this performance implication. Atm the solution we adopted is to incur the penalty cost of ("de-offseting buffers") when passing offsetted arrays via the c data interface, since this way users benefit from faster compute kernels and only incur this cost when it is strictly needed for the C data interface, but my understanding is that this design choice affects the compute kernels of most implementations, since they all perform a copy to de-offset the sliced buffers on every operation over sliced arrays? Best, Jorge