Re: [Discuss] Single offset per array has a non-trivial performance implication

Jorge Cardoso Leitão Wed, 27 Oct 2021 22:50:40 -0700

Hi,

> A big +1 to this, covering all the edge cases with slices is pretty
complicated (there was at least one long standing bug related to this in
the 6.0 release).  I imagine there are potentially more lurking in the code
base.

Thanks for this observation, arrow-rs faces a similar issue: it is
relatively easy to hit bugs because of the offset.

> To be clear, this only comes into play for bit buffers (such as the
validity bitmap), right?  Otherwise, the offset can just be incorporated
into the buffer's base pointer.

Exactly. Except for the BooleanArray, the "offset" in c data interface is
just an offset of the validity bitmap. I raised ARROW-14453 [1] to try to
mitigate this, but it does not solve it completely.

> This seems to assume that many or most arrays will have non-zero
offsets.  Is this something that commonly happens in the Rust Arrow
world?  In Arrow C++ I'm not sure non-zero offsets appear very frequently.

The main use-case I observe comes from Polars [2]. Polars is pretty fast
[3] and employs many techniques to extract performance from Arrow. Some
use-cases where slices are used (ccing Ritchie, that is the expert):
* users slice dataframes and series ad-hoc
* in group-bys and aggregations, polars slices arrays to parallelize the
workload when the chunks are large.
* "take" and "filter" of utf8 arrays is sufficiently expensive that it
slices the array and parallelizes the workload.
* group-bys with complex aggregations (rank, collect_list, reverse) it
builds a ListArray and then performs operations on the subitems. Given
[[None], [1, 2, 3, 4]], it operates on the subarray [1, 2, 3, 4] as "[None,
1, 2, 3, 4].slice(1, 4)", so, a slice per item.

[1] https://issues.apache.org/jira/browse/ARROW-14453
[2] https://github.com/pola-rs/polars
[3] https://h2oai.github.io/db-benchmark/

On Wed, Oct 27, 2021 at 7:57 PM Antoine Pitrou <anto...@python.org> wrote:

>
> Le 26/10/2021 à 21:30, Jorge Cardoso Leitão a écrit :
> > Hi,
> >
> > One aspect of the design of "arrow2" is that it deals with array slices
> > differently from the rest of the implementations. Essentially, the offset
> > is not stored in ArrayData, but on each individual Buffer. Some important
> > consequence are:
> >
> > * people can work with buffers and bitmaps without having to drag the
> > corresponding array offset with them (which are common source of
> > unsoundness in the official Rust implementation)
> > * arrays can store buffers/bitmaps with independent offsets
> > * it does not roundtrip over the c data interface at zero cost, because
> the
> > c data interface only allows a single offset per array, not per
> > buffer/bitmap.
>
> To be clear, this only comes into play for bit buffers (such as the
> validity bitmap), right?  Otherwise, the offset can just be incorporated
> into the buffer's base pointer.
>
>  > I have been benchmarking the consequences of this design choice and
> reached
>  > the conclusion that storing the offset on a per buffer basis offers at
>  > least 15% improvement in compute (results vary on kernel and likely
>  > implementation).
>
> This seems to assume that many or most arrays will have non-zero
> offsets.  Is this something that commonly happens in the Rust Arrow
> world?  In Arrow C++ I'm not sure non-zero offsets appear very frequently.
>
> Regards
>
> Antoine.
>

Re: [Discuss] Single offset per array has a non-trivial performance implication

Reply via email to