jorgecarleitao opened a new pull request #9413:
URL: https://github.com/apache/arrow/pull/9413


   ## Rational
   
   When `ArrayData::new` is used, we make no attempt to verify that its 
contents lead to `sound` code. In particular, for primitive arrays, the 
following must hold:
   
   ```
   /// * `values.len()` is a multiple of `size_of::<T::Native>`
   /// * `values.as_ptr()` is aligned with `T::Native`
   /// * when `nulls` is `Some`, `nulls.len` is equal to `(ArrayData::len + 7) 
/ 8`.
   /// * `offset <= ArrayData::len`
   ```
   
   * The first two conditions allow us to soundly transmute the `u8` buffer 
into `T::Native`.
   * The third condition allow us to access bits on the null bitmap up to 
`ArrayData::len`.
   * The last condition avoids out of bound accesses to the data when offsets 
are used.
   
   ## This PR
   
   This PR introduces two methods, one `safe` and one `unsafe`, that create an 
`ArrayData` for primitives by verifying that the above assumptions hold. The 
`safe` version uses `assert_*`, the `unsafe` uses `debug_assert_*`.
   
   This PR also migrates most of the `ArrayData::new` for primitive arrays to 
its `unsafe` counterpart introduced in this PR, to at least have a `debug` 
verification.
   
   While migrating code to this version, I found an out-of-bound access in the 
`nullif` kernel derived from passing a value buffer whose `len` is smaller than 
the necessary.
   
   Note that are two aspects of unsafety here:
   
   1. it is `unsafe` to pass a `len` to `ArrayData::new` that goes beyond 
`Buffer::len`. This is somewhat equivalent to the `TrustedLen` trait that Rust 
offers to "prove" that the `len` of an iterator is trustworthy.
   
   2. it is `unsafe` to pass a buffer to `ArrayData::new` that does not match 
the corresponding `DataType`. Again, doing so is `unsafe`, and therefore it is 
not possible to prove, by calling `ArrayData::new`, that reading from that 
buffer as `T::Native` is sound.
   
   So, in general, `ArrayData::new` is _very_ `unsafe`. This PR is an attempt 
to replace some of these calls by an `unsafe` counterpart, so that at least it 
is obvious that the call needs to fulfill certain invariants that cannot be 
proven by the compiler.
   
   I am in favor of marking `ArrayData::new` as `unsafe` (major backward 
incompatible change :/). This is because it is expensive to verify that 
`ArrayData` is safe for strings, as we need to verify that the value buffer 
contains valid utf8s.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to