alamb commented on code in PR #8990:
URL: https://github.com/apache/arrow-rs/pull/8990#discussion_r2617191664
##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -107,10 +108,25 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
current_size: STARTING_BLOCK_SIZE,
},
string_tracker: None,
+ max_deduplication_len: None,
phantom: Default::default(),
}
}
+ /// Configure max deduplication length when deduplicating strings while
building the array.
+ /// Default is [`MAX_INLINE_VIEW_LEN`] bytes.
+ /// See <https://github.com/apache/arrow-rs/issues/7187> for more details
on the implications.
+ pub fn with_max_deduplication_len(self, max_deduplication_len: u32) ->
Self {
+ debug_assert!(
Review Comment:
Why not leave use `assert` rather than `debug_assert`?
We should also document in the comments that the parameter must be greater
than zero if we are going to assert
##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -636,8 +658,58 @@ pub fn make_view(data: &[u8], block_id: u32, offset: u32)
-> u128 {
mod tests {
use core::str;
+ use arrow_buffer::ArrowNativeType;
+
use super::*;
+ #[test]
+ fn test_string_max_deduplication_len() {
+ let value_1 = "short";
+ let value_2 = "not so similar string but long";
+ let value_3 = "1234567890123";
+
+ let mut builder = StringViewBuilder::new()
+ .with_deduplicate_strings()
+ .with_max_deduplication_len(MAX_INLINE_VIEW_LEN * 2);
+ // safe to unwrap
+ let max_deduplication_len = builder.max_deduplication_len.unwrap();
+ assert!(builder.string_tracker.is_some());
+ assert!(max_deduplication_len > MAX_INLINE_VIEW_LEN);
Review Comment:
I am no sure that we need to test the getters here -- also when unwrapping
if you want to comment that it is safe, please explain *why* it is safe to
unwrap (the fact that it is safe is evident from the code and the fact that
this is run during tests)
##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -636,8 +658,58 @@ pub fn make_view(data: &[u8], block_id: u32, offset: u32)
-> u128 {
mod tests {
use core::str;
+ use arrow_buffer::ArrowNativeType;
+
use super::*;
+ #[test]
+ fn test_string_max_deduplication_len() {
+ let value_1 = "short";
+ let value_2 = "not so similar string but long";
+ let value_3 = "1234567890123";
+
+ let mut builder = StringViewBuilder::new()
+ .with_deduplicate_strings()
+ .with_max_deduplication_len(MAX_INLINE_VIEW_LEN * 2);
+ // safe to unwrap
+ let max_deduplication_len = builder.max_deduplication_len.unwrap();
+ assert!(builder.string_tracker.is_some());
+ assert!(max_deduplication_len > MAX_INLINE_VIEW_LEN);
+ assert!(value_1.len() < MAX_INLINE_VIEW_LEN.as_usize());
+ assert!(value_2.len() > max_deduplication_len.as_usize());
+ assert!(
+ value_3.len() > MAX_INLINE_VIEW_LEN.as_usize()
+ && value_3.len() < max_deduplication_len.as_usize()
+ );
+
+ let value_checker = |v: &[u8], builder: &StringViewBuilder| {
Review Comment:
Rather than asserting internal stae of the builder, how about setitng the
max deduplicate len to something small (like `1`) and then pushing deuplicated
strings in
You can then assert that all the values point at distinct offsets in the
resulting views
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]