alamb commented on code in PR #5895:
URL: https://github.com/apache/arrow-rs/pull/5895#discussion_r1643096252
##########
arrow-array/src/ffi.rs:
##########
@@ -1458,4 +1460,58 @@ mod tests_from_ffi {
test_round_trip(&imported_array.consume()?)
}
+
+ fn export_string(array: StringArray) -> StringArray {
+ let data = array.into_data();
+
+ let array = FFI_ArrowArray::new(&data);
+ let schema = FFI_ArrowSchema::try_from(data.data_type()).unwrap();
+
+ let array = unsafe { from_ffi(array, &schema) }.unwrap();
+ StringArray::from(array)
+ }
+
+ fn extend_array(array: &dyn Array) -> ArrayRef {
+ let len = array.len();
+ let data = array.to_data();
+
+ let mut mutable = MutableArrayData::new(vec![&data], false, len);
+ mutable.extend(0, 0, len);
+ make_array(mutable.freeze())
+ }
+
+ #[test]
+ fn test_extend_imported_string_slice() {
Review Comment:
I ran this test without the code changes in this PR and it fails like
```shell
range end index 10890 out of range for slice of length 5500
thread 'ffi::tests_from_ffi::test_extend_imported_string_slice' panicked at
arrow-data/src/transform/variable_size.rs:38:29:
range end index 10890 out of range for slice of length 5500
stack backtrace:
```
Thus I believe the test correctly covers the new code
##########
arrow-buffer/src/buffer/immutable.rs:
##########
@@ -71,6 +71,11 @@ impl Buffer {
}
}
+ /// Returns the internal byte buffer.
+ pub fn get_bytes(&self) -> Arc<Bytes> {
Review Comment:
Rather than exposing an internal detail, perhaps we can move the offset
calculation into `Buffer` (see suggestion below). In addition to keeping the
code more encapsulated, I think it would also likely be faster as it would
avoid the Arc clone.
##########
arrow-data/src/ffi.rs:
##########
@@ -131,6 +131,45 @@ impl FFI_ArrowArray {
data.buffers().iter().map(|b| Some(b.clone())).collect()
};
+ // Handle buffer offset for offset buffer.
+ let offset_offset = match data.data_type() {
+ DataType::Utf8 | DataType::Binary => {
+ // Offset buffer is possible a slice of the buffer.
+ // If we use slice pointer as exported buffer pointer, it will
cause
+ // the consumer to calculate incorrect length of data buffer
(buffer 1).
+ // We need to get the offset of the offset buffer and fill it
in
+ // the `FFI_ArrowArray` offset field.
+ Some(unsafe {
+ let b = &data.buffers()[0];
+ b.as_ptr().offset_from(b.get_bytes().ptr().as_ptr()) as
usize
Review Comment:
I think this calculation assumes that the string data begins at the start of
the buffer (aka `b.as_ptr()` is the base).
I double checked and the comments seem to imply that that is always the case
https://github.com/apache/arrow-rs/blob/c096172b769da8003ea7816086df60f38229a891/arrow-buffer/src/bytes.rs#L40-L39
Since this is calculating on internal state of `Buffer`, I think this code
would be easier to understand / document if we added a method to `Buffer` that
did the calculation and could be documented better:
Something like
```rust
impl Buffer {
..
/// Returns the offset, in bytes, of `Self::ptr` to `Self::data`
///
/// self.ptr and self.data can be different in the following cases:
/// TODO
fn ptr_offset(&self) -> usize {
...
}
```
I was non obvious to me when reading the `Buffer` code that it was possible
for `ptr` and `data` to be different
##########
arrow-array/src/ffi.rs:
##########
@@ -1458,4 +1460,58 @@ mod tests_from_ffi {
test_round_trip(&imported_array.consume()?)
}
+
+ fn export_string(array: StringArray) -> StringArray {
Review Comment:
Perhaps calling this function "roundtrip_string_array" would better
describe what it does
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]