nevi-me commented on a change in pull request #9413:
URL: https://github.com/apache/arrow/pull/9413#discussion_r575766126



##########
File path: rust/arrow/src/array/data.rs
##########
@@ -467,6 +467,37 @@ impl ArrayData {
     }
 }
 
+impl ArrayData {
+    /// Creates a new [`ArrayData`] for a primitive array, verifying all 
assumptions for consumers to
+    /// soundly use it.
+    /// # Panic
+    /// This function panics iff any of the following:
+    /// * `values.len()` is not a multiple of `size_of::<T::Native>`
+    /// * `values.as_ptr()` is not aligned with `T::Native`
+    /// * when `nulls` is `Some`, `nulls.len` is smaller than `(ArrayData::len 
+ 7) / 8`.
+    #[inline]
+    pub fn new_primitive<T: ArrowPrimitiveType>(
+        values: Buffer,
+        nulls: Option<Buffer>,

Review comment:
       Should this cater for offsets, or is your preference to omit it? If we 
took an offset, we'd add an extra assertion that checks that `offset < 
calculated_len`.
   
   Thinking more about it, if we don't take an offset, we might force users to 
still use `ArrayData::new` for that facility, which might be undesirable.

##########
File path: rust/arrow/src/compute/kernels/boolean.rs
##########
@@ -288,27 +288,25 @@ where
 
     // Align/shift left data on offset as needed, since new bitmaps are 
shifted and aligned to 0 already
     // NOTE: this probably only works for primitive arrays.
-    let data_buffers = if left.offset() == 0 {
-        left_data.buffers().to_vec()
+    let buffer = if left.offset() == 0 {
+        left_data.buffers()[0].clone()
     } else {
         // Shift each data buffer by type's bit_width * offset.
-        left_data
-            .buffers()
-            .iter()
-            .map(|buf| buf.slice(left.offset() * T::get_byte_width()))
-            .collect::<Vec<_>>()
+        left_data.buffers()[0].slice(left.offset() * T::get_byte_width())
     };
 
+    // UNSOUND: when `offset != 0`, the sliced buffer's `len` will be smaller 
than `left.len()`

Review comment:
       There's 2 issues:
   
   1. When slicing `ArrayData`, we do not pass the slice down to `Buffer`, even 
though it has a `slice()` method.
   
   2. The way that we calculate `Buffer::len` is by taking its data 
(`Arc<Bytes>`) and subtracting `Buffer::offset`. The conclusion that I reach is 
that `Buffer::len()` is incorrect.
   
   I came across this earlier today, and to address this, I'm trying out the 
following change:
   
   ```rust
   pub struct Buffer {
       /// the internal byte buffer.
       data: Arc<Bytes>,
   
       /// The offset into the buffer.
       offset: usize,
   
       /// The logical length of the buffer. [ADDED]
       length: usize,
   }
   ```
   
   The only time we set `Buffer::offset` != 0 is on `Buffer::slice(offset: 
usize)`.
   Now, if I have an array of 24 values, and slice it with offset 12 and length 
8, we have a new array with len = 8, but its buffer's len is still 24. 
   If we pass down both the offset and length to the buffer, we would make the 
whole thing consistent; and avoid the above check.

##########
File path: rust/arrow/src/compute/kernels/arithmetic.rs
##########
@@ -225,15 +205,7 @@ where
         unsafe { Buffer::try_from_trusted_len_iter(values) }
     }?;
 
-    let data = ArrayData::new(

Review comment:
       I agree!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to