Re: [Discuss] Single offset per array has a non-trivial performance implication

2021-10-28 Thread Jörn Horstmann
Hi,

On Wed, Oct 27, 2021 at 7:57 PM Antoine Pitrou  wrote:
>
> This seems to assume that many or most arrays will have non-zero
> offsets.  Is this something that commonly happens in the Rust Arrow
> world?  In Arrow C++ I'm not sure non-zero offsets appear very frequently.
>
> Regards
>
> Antoine.

Regarding usecases, I can add two from our internal query engine:

- we store and cache complete columns in memory and then process them
in smaller, cache-optimized, batches, which are created by slicing the
full dataset.
- our query engine and data model uses ListArrays a lot and common
operations are aggregations of the values nested in these ListArrays.
So for example for a ListArray of Float64Array, calculating a
Float64Array containing the sums of each nested array.

The first design decision initially uncovered a lot of assumptions
about offsets being 0. It is however again a bit of a special case
since the offsets are usually multiples of 8 and so can be pushed down
into the buffers' base pointer even for validity and boolean.

Regards,
-- 
Jörn Horstmann | Senior Backend Engineer

www.signavio.com
Kurfürstenstraße 111, 10787 Berlin, Germany


Re: [Discuss] Single offset per array has a non-trivial performance implication

2021-10-27 Thread Jorge Cardoso Leitão
Hi,

> A big +1 to this, covering all the edge cases with slices is pretty
complicated (there was at least one long standing bug related to this in
the 6.0 release).  I imagine there are potentially more lurking in the code
base.

Thanks for this observation, arrow-rs faces a similar issue: it is
relatively easy to hit bugs because of the offset.

> To be clear, this only comes into play for bit buffers (such as the
validity bitmap), right?  Otherwise, the offset can just be incorporated
into the buffer's base pointer.

Exactly. Except for the BooleanArray, the "offset" in c data interface is
just an offset of the validity bitmap. I raised ARROW-14453 [1] to try to
mitigate this, but it does not solve it completely.

> This seems to assume that many or most arrays will have non-zero
offsets.  Is this something that commonly happens in the Rust Arrow
world?  In Arrow C++ I'm not sure non-zero offsets appear very frequently.

The main use-case I observe comes from Polars [2]. Polars is pretty fast
[3] and employs many techniques to extract performance from Arrow. Some
use-cases where slices are used (ccing Ritchie, that is the expert):
* users slice dataframes and series ad-hoc
* in group-bys and aggregations, polars slices arrays to parallelize the
workload when the chunks are large.
* "take" and "filter" of utf8 arrays is sufficiently expensive that it
slices the array and parallelizes the workload.
* group-bys with complex aggregations (rank, collect_list, reverse) it
builds a ListArray and then performs operations on the subitems. Given
[[None], [1, 2, 3, 4]], it operates on the subarray [1, 2, 3, 4] as "[None,
1, 2, 3, 4].slice(1, 4)", so, a slice per item.

[1] https://issues.apache.org/jira/browse/ARROW-14453
[2] https://github.com/pola-rs/polars
[3] https://h2oai.github.io/db-benchmark/


On Wed, Oct 27, 2021 at 7:57 PM Antoine Pitrou  wrote:

>
> Le 26/10/2021 à 21:30, Jorge Cardoso Leitão a écrit :
> > Hi,
> >
> > One aspect of the design of "arrow2" is that it deals with array slices
> > differently from the rest of the implementations. Essentially, the offset
> > is not stored in ArrayData, but on each individual Buffer. Some important
> > consequence are:
> >
> > * people can work with buffers and bitmaps without having to drag the
> > corresponding array offset with them (which are common source of
> > unsoundness in the official Rust implementation)
> > * arrays can store buffers/bitmaps with independent offsets
> > * it does not roundtrip over the c data interface at zero cost, because
> the
> > c data interface only allows a single offset per array, not per
> > buffer/bitmap.
>
> To be clear, this only comes into play for bit buffers (such as the
> validity bitmap), right?  Otherwise, the offset can just be incorporated
> into the buffer's base pointer.
>
>  > I have been benchmarking the consequences of this design choice and
> reached
>  > the conclusion that storing the offset on a per buffer basis offers at
>  > least 15% improvement in compute (results vary on kernel and likely
>  > implementation).
>
> This seems to assume that many or most arrays will have non-zero
> offsets.  Is this something that commonly happens in the Rust Arrow
> world?  In Arrow C++ I'm not sure non-zero offsets appear very frequently.
>
> Regards
>
> Antoine.
>


Re: [Discuss] Single offset per array has a non-trivial performance implication

2021-10-27 Thread Antoine Pitrou



Le 26/10/2021 à 21:30, Jorge Cardoso Leitão a écrit :

Hi,

One aspect of the design of "arrow2" is that it deals with array slices
differently from the rest of the implementations. Essentially, the offset
is not stored in ArrayData, but on each individual Buffer. Some important
consequence are:

* people can work with buffers and bitmaps without having to drag the
corresponding array offset with them (which are common source of
unsoundness in the official Rust implementation)
* arrays can store buffers/bitmaps with independent offsets
* it does not roundtrip over the c data interface at zero cost, because the
c data interface only allows a single offset per array, not per
buffer/bitmap.


To be clear, this only comes into play for bit buffers (such as the 
validity bitmap), right?  Otherwise, the offset can just be incorporated 
into the buffer's base pointer.


> I have been benchmarking the consequences of this design choice and 
reached

> the conclusion that storing the offset on a per buffer basis offers at
> least 15% improvement in compute (results vary on kernel and likely
> implementation).

This seems to assume that many or most arrays will have non-zero 
offsets.  Is this something that commonly happens in the Rust Arrow 
world?  In Arrow C++ I'm not sure non-zero offsets appear very frequently.


Regards

Antoine.


Re: [Discuss] Single offset per array has a non-trivial performance implication

2021-10-26 Thread Micah Kornfield
>
> To understand why this is the case, consider comparing two boolean arrays
> (a, b), where "a" has been sliced and has a validity and "b" does not. In
> this case, we could compare the values of the arrays (taking into account
> "a"'s offset), and clone "a"'s validity. However, this does not work
> because the validity is "offsetted", while the result of the comparison of
> the values is not. Thus, we need to create a shifted copy of the validity.


C++ has optimized routines for this that attempt to do this without copies
[1].  It has been a while since I looked  but I think in the worst case for
offsets, it might fall back to bit at a time operations, which I suppose
might actually be worse then the copy + shift).


> Since the output array
> would not have an offset we would need to create a new buffer but this
> would be a zero copy operation that references the original buffer
> (buffers in the C++ impl can have a shared pointer to a parent for
> reference counting purposes).  Now, I'm not sure if we actually
> perform this optimization, but it should be possible.

This might be obvious but It is a little more difficult than that, since
the offset represents a bit offset for validity buffer, so if you want it
to be zero copy, then you need to offset the value buffer as well (assuming
the offset isn't a multiple of 8).


All of that being said, I can also understand your motivation, we have
> certainly had bugs in the C++ implementation in the past because
> kernels forgot to account for the array offset.

A big +1 to this, covering all the edge cases with slices is pretty
complicated (there was at least one long standing bug related to this in
the 6.0 release).  I imagine there are potentially more lurking in the code
base.

I don't actually know
> the reason we need to keep the offset around after a slice operation
> but maybe someone else has more history on that decision.


These aren't necessarily strong justifications (I believe Wes originally
designed the class hierarchy so he would be the best person to comment):
1.   You would likely need a different member type (Bitmap) to track offset
+ validity buffer, instead of the raw validity buffer that is currently
stored in ArrayData.  I think the Bitmap class came sometime after the
original layout was defined.  So it might have simply been natural
evolution.
2.   Keeping the offset at the array level allows for laziness for lazy
slicing of child arrays for structs.
3.   Slightly higher space overhead for every buffer has a slice (this
likely was not the reason).

I remember there was a lot of discussion on including the slice in the
C-API at all because of the complexity, I don't recall anything about
individual buffer offsets, but it might be worth finding and looking over
that thread if there aren't enough answers provided here.

[1]
https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/arrow/util/bitmap_ops.cc






On Tue, Oct 26, 2021 at 6:27 PM Weston Pace  wrote:

> I don't think the presence of array-level offsets precludes the
> presence of buffer-level offsets.  For example, in the C++
> implementation we have both buffer offsets and array offsets.  Buffer
> offsets are used mainly in the IPC layer I think when we are
> constructing arrays from larger memory mapped regions.  Array offsets
> are used when we need to slice arrays.  At the C data interface a
> buffer is just void* and the responsibility for releasing the buffer
> remains with the producer so the consumer doesn't need to know that
> the void* is actually an offset into a larger range of memory or a
> fully self-contained allocation.
>
> > but my understanding
> > is that this design choice affects the compute kernels of most
> > implementations, since they all perform a copy to de-offset the sliced
> > buffers on every operation over sliced arrays?
>
> I'm not sure what you are describing here.  In the C++ impl I do not
> believe we perform a copy to de-offset sliced buffers.  For example,
> given your boolean scenario, then we would have an input Array with an
> offset and that Array's data would contain a buffer for the validity
> (which could have its own independent offset).  Since the output array
> would not have an offset we would need to create a new buffer but this
> would be a zero copy operation that references the original buffer
> (buffers in the C++ impl can have a shared pointer to a parent for
> reference counting purposes).  Now, I'm not sure if we actually
> perform this optimization, but it should be possible.
>
> All of that being said, I can also understand your motivation, we have
> certainly had bugs in the C++ implementation in the past because
> kernels forgot to account for the array offset.  I don't actually know
> the reason we need to keep the offset around after a slice operation
> but maybe someone else has more history on that decision.
>
> On Tue, Oct 26, 2021 at 9:31 AM Jorge Cardoso Leitão
>  wrote:
> >
> >

Re: [Discuss] Single offset per array has a non-trivial performance implication

2021-10-26 Thread Weston Pace
I don't think the presence of array-level offsets precludes the
presence of buffer-level offsets.  For example, in the C++
implementation we have both buffer offsets and array offsets.  Buffer
offsets are used mainly in the IPC layer I think when we are
constructing arrays from larger memory mapped regions.  Array offsets
are used when we need to slice arrays.  At the C data interface a
buffer is just void* and the responsibility for releasing the buffer
remains with the producer so the consumer doesn't need to know that
the void* is actually an offset into a larger range of memory or a
fully self-contained allocation.

> but my understanding
> is that this design choice affects the compute kernels of most
> implementations, since they all perform a copy to de-offset the sliced
> buffers on every operation over sliced arrays?

I'm not sure what you are describing here.  In the C++ impl I do not
believe we perform a copy to de-offset sliced buffers.  For example,
given your boolean scenario, then we would have an input Array with an
offset and that Array's data would contain a buffer for the validity
(which could have its own independent offset).  Since the output array
would not have an offset we would need to create a new buffer but this
would be a zero copy operation that references the original buffer
(buffers in the C++ impl can have a shared pointer to a parent for
reference counting purposes).  Now, I'm not sure if we actually
perform this optimization, but it should be possible.

All of that being said, I can also understand your motivation, we have
certainly had bugs in the C++ implementation in the past because
kernels forgot to account for the array offset.  I don't actually know
the reason we need to keep the offset around after a slice operation
but maybe someone else has more history on that decision.

On Tue, Oct 26, 2021 at 9:31 AM Jorge Cardoso Leitão
 wrote:
>
> Hi,
>
> One aspect of the design of "arrow2" is that it deals with array slices
> differently from the rest of the implementations. Essentially, the offset
> is not stored in ArrayData, but on each individual Buffer. Some important
> consequence are:
>
> * people can work with buffers and bitmaps without having to drag the
> corresponding array offset with them (which are common source of
> unsoundness in the official Rust implementation)
> * arrays can store buffers/bitmaps with independent offsets
> * it does not roundtrip over the c data interface at zero cost, because the
> c data interface only allows a single offset per array, not per
> buffer/bitmap.
>
> I have been benchmarking the consequences of this design choice and reached
> the conclusion that storing the offset on a per buffer basis offers at
> least 15% improvement in compute (results vary on kernel and likely
> implementation).
>
> To understand why this is the case, consider comparing two boolean arrays
> (a, b), where "a" has been sliced and has a validity and "b" does not. In
> this case, we could compare the values of the arrays (taking into account
> "a"'s offset), and clone "a"'s validity. However, this does not work
> because the validity is "offsetted", while the result of the comparison of
> the values is not. Thus, we need to create a shifted copy of the validity.
> I measure 15% of the total compute time on my benches being done on
> creating this shifted copy.
>
> The root cause is that the C data interface declares an offset on the
> ArrayData, as opposed to an offset on each of the buffers contained on it.
> With an offset shared between buffers, we can't slice individual bitmap
> buffers, which forbids us from leveraging the optimization of simply
> cloning buffers instead of copying them.
>
> I wonder whether this was discussed previously, or whether the "single
> offset per array in the c data interface" considered this performance
> implication.
>
> Atm the solution we adopted is to incur the penalty cost of ("de-offseting
> buffers") when passing offsetted arrays via the c data interface, since
> this way users benefit from faster compute kernels and only incur this cost
> when it is strictly needed for the C data interface, but my understanding
> is that this design choice affects the compute kernels of most
> implementations, since they all perform a copy to de-offset the sliced
> buffers on every operation over sliced arrays?
>
> Best,
> Jorge