alamb commented on code in PR #8990:
URL: https://github.com/apache/arrow-rs/pull/8990#discussion_r2617191664


##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -107,10 +108,25 @@ impl<T: ByteViewType + ?Sized> GenericByteViewBuilder<T> {
                 current_size: STARTING_BLOCK_SIZE,
             },
             string_tracker: None,
+            max_deduplication_len: None,
             phantom: Default::default(),
         }
     }
 
+    /// Configure max deduplication length when deduplicating strings while 
building the array.
+    /// Default is [`MAX_INLINE_VIEW_LEN`] bytes.
+    /// See <https://github.com/apache/arrow-rs/issues/7187> for more details 
on the implications.
+    pub fn with_max_deduplication_len(self, max_deduplication_len: u32) -> 
Self {
+        debug_assert!(

Review Comment:
   Why not leave use `assert` rather than `debug_assert`? 
   
   We should also document in the comments that the parameter must be greater 
than zero if we are going to assert



##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -636,8 +658,58 @@ pub fn make_view(data: &[u8], block_id: u32, offset: u32) 
-> u128 {
 mod tests {
     use core::str;
 
+    use arrow_buffer::ArrowNativeType;
+
     use super::*;
 
+    #[test]
+    fn test_string_max_deduplication_len() {
+        let value_1 = "short";
+        let value_2 = "not so similar string but long";
+        let value_3 = "1234567890123";
+
+        let mut builder = StringViewBuilder::new()
+            .with_deduplicate_strings()
+            .with_max_deduplication_len(MAX_INLINE_VIEW_LEN * 2);
+        // safe to unwrap
+        let max_deduplication_len = builder.max_deduplication_len.unwrap();
+        assert!(builder.string_tracker.is_some());
+        assert!(max_deduplication_len > MAX_INLINE_VIEW_LEN);

Review Comment:
   I am no sure that we need to test the getters here  -- also when unwrapping 
if you want to comment that it is safe, please explain *why* it is safe to 
unwrap (the fact that it is safe is evident from the code and the fact that 
this is run during tests)
   
   



##########
arrow-array/src/builder/generic_bytes_view_builder.rs:
##########
@@ -636,8 +658,58 @@ pub fn make_view(data: &[u8], block_id: u32, offset: u32) 
-> u128 {
 mod tests {
     use core::str;
 
+    use arrow_buffer::ArrowNativeType;
+
     use super::*;
 
+    #[test]
+    fn test_string_max_deduplication_len() {
+        let value_1 = "short";
+        let value_2 = "not so similar string but long";
+        let value_3 = "1234567890123";
+
+        let mut builder = StringViewBuilder::new()
+            .with_deduplicate_strings()
+            .with_max_deduplication_len(MAX_INLINE_VIEW_LEN * 2);
+        // safe to unwrap
+        let max_deduplication_len = builder.max_deduplication_len.unwrap();
+        assert!(builder.string_tracker.is_some());
+        assert!(max_deduplication_len > MAX_INLINE_VIEW_LEN);
+        assert!(value_1.len() < MAX_INLINE_VIEW_LEN.as_usize());
+        assert!(value_2.len() > max_deduplication_len.as_usize());
+        assert!(
+            value_3.len() > MAX_INLINE_VIEW_LEN.as_usize()
+                && value_3.len() < max_deduplication_len.as_usize()
+        );
+
+        let value_checker = |v: &[u8], builder: &StringViewBuilder| {

Review Comment:
   Rather than asserting internal stae of the builder, how about setitng the 
max deduplicate len to something small (like `1`) and then pushing deuplicated 
strings in 
   
   You can then assert that all the values point at distinct offsets in the 
resulting views



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to