ctsk commented on issue #6692:
URL: https://github.com/apache/arrow-rs/issues/6692#issuecomment-2785965705
I've been pondering the problem of an "appending" take kernel and would like
to share my thoughts.
### Coming from DataFusion...
Based on the experimental implementation of `take_in`, I believe the
interface needed for DataFusion's use case looks like this:
```rust
trait ColumnSink {
fn take(&mut self, source: &dyn Array, indices: &dyn Array);
fn emit_finished(&mut self) -> Option<Arc<dyn Array>>;
fn flush(&mut self);
}
```
ColumnSink maintains internal state, with `ColumnSink::take` logically
appending the result of the `take` operation to its internal state. A key
property is that ColumnSink can output its state incrementally rather than
requiring a single vector output. This makes ColumnSink at least as capable as
DataFusion's current approach, as it can fall back to using the current `take`
kernel and leave concatenation to a `CoalesceBatchesExec`.
I believe this abstraction belongs to DataFusion rather than arrow-rs.
### Allocation-minimal kernels in arrow-rs
Ideally, ColumnSink would produce perfectly-sized batches of a given size,
requiring it to control the destination area's sizing. This calls for a version
of take that writes into a provided buffer. I propose:
```rust
fn take_and_append(
source: &dyn Array,
indices: &dyn Array,
destination: ???
) -> Result<(), ArrowError>;
```
The question is what type should replace `???`. I've considered these
options:
#### Option 1: `Box<dyn ArrayBuilder>` | `&mut dyn ArrayBuilder`
This was the starting point for this issue. ArrayBuilder is convenient
because it already contains the appropriate data structures. While
`PrimitiveArrays` hold all state in the result array, for `DictionaryArrays`,
we'd want the destination to hold a hash table that won't be in the finished
array.
The challenge with using ArrayBuilder is that builders don't sufficiently
expose their internals for kernel implementations.
Potential approaches:
1. **Expand the builder interface**: Add methods giving callers more control
over builder mutation. The challenge is finding reusable abstractions between
builders, as well as balancing between `unsafe` methods that might leave the
builder in an inconsistent state and enforcing consistency after each
manipulation.
2. **Deconstruction approach**: Each builder would implement `into_parts`
and `new`/`new_unchecked`, similar to what exists for Arrays. Kernels could
arbitrarily mutate these buffers and then reconstruct the builder. This seems
cleaner than expanding the builder interface further.
The deconstruction approach has the drawback of requiring ownership of the
builder, whereas a mutable reference would make more sense semantically.
#### Option 2: `&mut MutableArray`
Taking the deconstruction approach further, we could introduce mutable
counterparts (`MutablePrimitiveArray<T>`, etc.) for each array type as the
output of `into_parts`.
We would add a type-erased MutableArray abstraction on top. This type would
essentially hold the same state as a builder but could be more freely mutated
without many consistency guarantees. Consistency would be checked when
converting from MutableArray to Array.
Each concrete mutable array would expose methods to manipulate each of its
fields, making its internal data and workings transparent.
The trait could be:
```rust
trait MutableArray {
fn data_type(&self) -> DataType;
fn len(&self) -> usize;
fn capacity(&self) -> usize;
fn into_array(self) -> Arc<dyn Array>;
fn into_builder(self) -> Arc<dyn ArrayBuilder>;
}
```
It wouldn't need common methods to manipulate the MutableArray since kernels
would downcast it before manipulation. It *could* feature a method to modify
the optional null buffer.
*All names are subject to change*
Thoughts?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]