Re: [PR] fix: Adjust FFI_ArrowArray offset based on the offset of offset buffer [arrow-rs]

via GitHub Mon, 17 Jun 2024 09:52:12 -0700


alamb commented on code in PR #5895:
URL: https://github.com/apache/arrow-rs/pull/5895#discussion_r1643096252



##########
arrow-array/src/ffi.rs:
##########
@@ -1458,4 +1460,58 @@ mod tests_from_ffi {
 
         test_round_trip(&imported_array.consume()?)
     }
+
+    fn export_string(array: StringArray) -> StringArray {
+        let data = array.into_data();
+
+        let array = FFI_ArrowArray::new(&data);
+        let schema = FFI_ArrowSchema::try_from(data.data_type()).unwrap();
+
+        let array = unsafe { from_ffi(array, &schema) }.unwrap();
+        StringArray::from(array)
+    }
+
+    fn extend_array(array: &dyn Array) -> ArrayRef {
+        let len = array.len();
+        let data = array.to_data();
+
+        let mut mutable = MutableArrayData::new(vec![&data], false, len);
+        mutable.extend(0, 0, len);
+        make_array(mutable.freeze())
+    }
+
+    #[test]
+    fn test_extend_imported_string_slice() {

Review Comment:
   I ran this test without the code changes in this PR and it fails like
   
   ```shell
   range end index 10890 out of range for slice of length 5500
   thread 'ffi::tests_from_ffi::test_extend_imported_string_slice' panicked at 
arrow-data/src/transform/variable_size.rs:38:29:
   range end index 10890 out of range for slice of length 5500
   stack backtrace:
   ```
   
   Thus I believe the test correctly covers the new code



##########
arrow-buffer/src/buffer/immutable.rs:
##########
@@ -71,6 +71,11 @@ impl Buffer {
         }
     }
 
+    /// Returns the internal byte buffer.
+    pub fn get_bytes(&self) -> Arc<Bytes> {

Review Comment:
   Rather than exposing an internal detail, perhaps we can move the offset 
calculation into `Buffer` (see suggestion below). In addition to keeping the 
code more encapsulated, I think it would also likely be faster as it would 
avoid the Arc clone.



##########
arrow-data/src/ffi.rs:
##########
@@ -131,6 +131,45 @@ impl FFI_ArrowArray {
             data.buffers().iter().map(|b| Some(b.clone())).collect()
         };
 
+        // Handle buffer offset for offset buffer.
+        let offset_offset = match data.data_type() {
+            DataType::Utf8 | DataType::Binary => {
+                // Offset buffer is possible a slice of the buffer.
+                // If we use slice pointer as exported buffer pointer, it will 
cause
+                // the consumer to calculate incorrect length of data buffer 
(buffer 1).
+                // We need to get the offset of the offset buffer and fill it 
in
+                // the `FFI_ArrowArray` offset field.
+                Some(unsafe {
+                    let b = &data.buffers()[0];
+                    b.as_ptr().offset_from(b.get_bytes().ptr().as_ptr()) as 
usize

Review Comment:
   I think this calculation assumes that the string data begins at the start of 
the buffer (aka `b.as_ptr()` is the base). 
   
   I double checked and the comments seem to imply that that is always the case
   
https://github.com/apache/arrow-rs/blob/c096172b769da8003ea7816086df60f38229a891/arrow-buffer/src/bytes.rs#L40-L39
   
   Since this is calculating on internal state of `Buffer`, I think this code 
would be easier to understand / document if we added a method to `Buffer` that 
did the calculation and could be documented better:
   
   Something like
   ```rust
   impl Buffer {
     .. 
     /// Returns the offset, in bytes, of `Self::ptr` to `Self::data`
     ///
     /// self.ptr and self.data can be different in the following cases:
     /// TODO
     fn ptr_offset(&self) -> usize {
   ...
     }
   ```
   
   I was non obvious to me when reading the `Buffer` code that it was possible 
for `ptr` and `data` to be different



##########
arrow-array/src/ffi.rs:
##########
@@ -1458,4 +1460,58 @@ mod tests_from_ffi {
 
         test_round_trip(&imported_array.consume()?)
     }
+
+    fn export_string(array: StringArray) -> StringArray {

Review Comment:
   Perhaps calling this function  "roundtrip_string_array" would better 
describe what it does



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix: Adjust FFI_ArrowArray offset based on the offset of offset buffer [arrow-rs]

Reply via email to