Hi,

Over the past 4 months, I have been growing more and more frustrated by the
amount of undefined behaviour that I am finding and fixing on the Rust
implementation. I would like to open the discussion of a broader overview
about the problem in light of our current knowledge and what Rust enables
as well as offer a solution to the bigger problem.

Just to give you a gist of the seriousness of the issue, the following
currently compiles, runs, and is undefined behavior in Rust:

let buffer = Buffer::from(&[0i32, 2i32]);let data =
ArrayData::new(DataType::Int64, 10, 0, None, 0, vec![buffer],
vec![]);let array = Float64Array::from(Arc::new(data));
println!("{:?}", array.value(1));

I would like to propose a major refactor of the crate around physical
traits, Buffer, MutableBuffer and ArrayData to make our code type-safe at
compile time, thereby avoiding things like the example above from happening
again.

So far, I was able to reproduce all core features of the arrow crate
(nested types, dynamic typing, FFI, memory alignment, performance) by using
`Buffer<T: NativeType>` instead of `Buffer` and removing `ArrayData` and
RawPointer altogether.

Safety-wise, it significantly limits the usage of `unsafe` on higher end
APIs, it has a single transmute (the bit chunk iterator one), and a
guaranteed safe public API (which is not the case in our master, as shown
above).

Performance wise, it yields a 1.3x improvement over the current master
(after this fix <https://github.com/apache/arrow/pull/9301> of UB on the
take kernel, 1.7x prior to it) for the `take` kernel for primitives. I
should have other major performance improvements.

API wise, it simplifies the traits that we have for memory layout as well
as the handling of bitmaps, offsets, etc.

The proposal is drafted as a README
<https://github.com/jorgecarleitao/arrow2/blob/proposal/README.md> on a
repo that I created specifically for this from the ground up, and the full
set of changes are in a PR <https://github.com/jorgecarleitao/arrow2/pull/1>
so that anyone can view and comment on it. I haven't made any PR to master
because this is too large to track as a diff against master, and is beyond
the point, anyways.

I haven't ported most of the crate as I only tried the non-trivial features
(memory layout, bitmaps, FFI, dynamic typing, nested types).

I would highly appreciate your thoughts about it.

Best,
Jorge

Reply via email to