alamb commented on code in PR #5885:
URL: https://github.com/apache/arrow-rs/pull/5885#discussion_r1639003160
##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -645,4 +695,32 @@ mod tests {
StringViewArray::new(views, buffers, None);
}
+
+ #[test]
+ fn test_gc() {
+ let mut array = StringViewArray::from_iter_values(["while my guitar
gently weeps"]);
Review Comment:
Can we please add test coverage for:
1. An input array with more than 1 view
2. At least 2 views that point into the same underlying buffer
3. A short view (aka value is inlined, than 12 bytes)
##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -265,6 +265,56 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
phantom: Default::default(),
}
}
+
+ /// Returns a buffer compact version of this array
Review Comment:
```suggestion
/// Returns a "compacted" version of this array
```
##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -265,6 +265,56 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
phantom: Default::default(),
}
}
+
+ /// Returns a buffer compact version of this array
+ ///
+ /// The original array will *not* be modified
+ ///
+ /// # Garbage Collection
+ ///
+ /// Before GC:
+ /// ```text
+ /// ┌──────┐
+ /// │......│
+ /// │......│
+ /// ┌────────────────────┐ ┌ ─ ─ ─ ▶ │Data1 │ Large buffer
+ /// │ View 1 │─ ─ ─ ─ │......│ with data that
+ /// ├────────────────────┤ │......│ is not referred
+ /// │ View 2 │─ ─ ─ ─ ─ ─ ─ ─▶ │Data2 │ to by View 1 or
+ /// └────────────────────┘ │......│ View 2
+ /// │......│
+ /// 2 views, refer to │......│
+ /// small portions of a └──────┘
+ /// large buffer
+ /// ```
+ ///
+ /// After GC:
+ ///
+ /// ```text
+ /// ┌────────────────────┐ ┌─────┐ After gc, only
+ /// │ View 1 │─ ─ ─ ─ ─ ─ ─ ─▶ │Data1│ data that is
+ /// ├────────────────────┤ ┌ ─ ─ ─ ▶ │Data2│ pointed to by
+ /// │ View 2 │─ ─ ─ ─ └─────┘ the views is
+ /// └────────────────────┘ left
+ ///
+ ///
+ /// 2 views
+ /// ```
+ /// This method will compact the data buffers by recreating the view array
and only include the data
+ /// that is pointed to by the views.
+ ///
+ /// Note that it will copy the array regardless of whether the original
array is compact.
Review Comment:
👍
##########
arrow-array/src/array/byte_view_array.rs:
##########
@@ -265,6 +265,56 @@ impl<T: ByteViewType + ?Sized> GenericByteViewArray<T> {
phantom: Default::default(),
}
}
+
+ /// Returns a buffer compact version of this array
+ ///
+ /// The original array will *not* be modified
+ ///
+ /// # Garbage Collection
+ ///
+ /// Before GC:
+ /// ```text
+ /// ┌──────┐
+ /// │......│
+ /// │......│
+ /// ┌────────────────────┐ ┌ ─ ─ ─ ▶ │Data1 │ Large buffer
+ /// │ View 1 │─ ─ ─ ─ │......│ with data that
+ /// ├────────────────────┤ │......│ is not referred
+ /// │ View 2 │─ ─ ─ ─ ─ ─ ─ ─▶ │Data2 │ to by View 1 or
+ /// └────────────────────┘ │......│ View 2
+ /// │......│
+ /// 2 views, refer to │......│
+ /// small portions of a └──────┘
+ /// large buffer
+ /// ```
+ ///
+ /// After GC:
+ ///
+ /// ```text
+ /// ┌────────────────────┐ ┌─────┐ After gc, only
+ /// │ View 1 │─ ─ ─ ─ ─ ─ ─ ─▶ │Data1│ data that is
+ /// ├────────────────────┤ ┌ ─ ─ ─ ▶ │Data2│ pointed to by
+ /// │ View 2 │─ ─ ─ ─ └─────┘ the views is
+ /// └────────────────────┘ left
+ ///
+ ///
+ /// 2 views
+ /// ```
+ /// This method will compact the data buffers by recreating the view array
and only include the data
+ /// that is pointed to by the views.
+ ///
+ /// Note that it will copy the array regardless of whether the original
array is compact.
+ /// Use with caution as this can be an expensive operation, only use it
when you are sure that the view
+ /// array is significantly smaller than when it is originally created,
e.g., after filtering or slicing.
+ pub fn gc(&self) -> Self {
+ let mut builder =
GenericByteViewBuilder::<T>::with_capacity(self.len());
+
+ for v in self.iter() {
+ builder.append_option(v);
Review Comment:
I think this is reasonable (and correct) first version
However, it could likely be made faster by special casing inline views --
the code here is going to be doing several checks on length only to copy the
bytes back into a u128
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]