jorgecarleitao opened a new pull request #9413: URL: https://github.com/apache/arrow/pull/9413
## Rational When `ArrayData::new` is used, we make no attempt to verify that its contents lead to `sound` code. In particular, for primitive arrays, the following must hold: ``` /// * `values.len()` is a multiple of `size_of::<T::Native>` /// * `values.as_ptr()` is aligned with `T::Native` /// * when `nulls` is `Some`, `nulls.len` is equal to `(ArrayData::len + 7) / 8`. /// * `offset <= ArrayData::len` ``` * The first two conditions allow us to soundly transmute the `u8` buffer into `T::Native`. * The third condition allow us to access bits on the null bitmap up to `ArrayData::len`. * The last condition avoids out of bound accesses to the data when offsets are used. ## This PR This PR introduces two methods, one `safe` and one `unsafe`, that create an `ArrayData` for primitives by verifying that the above assumptions hold. The `safe` version uses `assert_*`, the `unsafe` uses `debug_assert_*`. This PR also migrates most of the `ArrayData::new` for primitive arrays to its `unsafe` counterpart introduced in this PR, to at least have a `debug` verification. While migrating code to this version, I found an out-of-bound access in the `nullif` kernel derived from passing a value buffer whose `len` is smaller than the necessary. Note that are two aspects of unsafety here: 1. it is `unsafe` to pass a `len` to `ArrayData::new` that goes beyond `Buffer::len`. This is somewhat equivalent to the `TrustedLen` trait that Rust offers to "prove" that the `len` of an iterator is trustworthy. 2. it is `unsafe` to pass a buffer to `ArrayData::new` that does not match the corresponding `DataType`. Again, doing so is `unsafe`, and therefore it is not possible to prove, by calling `ArrayData::new`, that reading from that buffer as `T::Native` is sound. So, in general, `ArrayData::new` is _very_ `unsafe`. This PR is an attempt to replace some of these calls by an `unsafe` counterpart, so that at least it is obvious that the call needs to fulfill certain invariants that cannot be proven by the compiler. I am in favor of marking `ArrayData::new` as `unsafe` (major backward incompatible change :/). This is because it is expensive to verify that `ArrayData` is safe for strings, as we need to verify that the value buffer contains valid utf8s. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
