This is an automated email from the ASF dual-hosted git repository.

Dandandan pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git


The following commit(s) were added to refs/heads/main by this push:
     new 3790d88b5e Pre-size dedup HashTable in 
GenericByteDictionaryBuilder::with_capacity (#9908)
3790d88b5e is described below

commit 3790d88b5e03a0b68c68a288770c980b63772ccb
Author: Sebastian Rabenhorst <[email protected]>
AuthorDate: Tue May 5 20:22:32 2026 +0200

    Pre-size dedup HashTable in GenericByteDictionaryBuilder::with_capacity 
(#9908)
    
    # Which issue does this PR close?
    
    - Closes #9907.
    
    # Rationale for this change
    
    `GenericByteDictionaryBuilder::with_capacity` left the internal `dedup`
    `HashTable` at default capacity (0), forcing 0 → 4 → 8 → 16 → 32 → 64 →
    … resize+rehash cycles on the first inserts even when callers passed a
    non-zero `value_capacity`.
    
    `new()` already sizes `dedup` from `keys_builder.capacity()`;
    `with_capacity` (the variant intended to amortize allocations) should do
    the same from `value_capacity`. As-is, callers reaching for
    `with_capacity` to avoid alloc churn get the opposite of what they asked
    for on the dedup path.
    
    See #9907 for the full context including a CPU profile showing
    `GenericByteDictionaryBuilder::append` at 7.24% cum where ~half is the
    resize+rehash cost on the first batch.
    
    # What changes are included in this PR?
    
    - `arrow-array/src/builder/generic_bytes_dictionary_builder.rs`: change
    `dedup: Default::default()` to `dedup:
    HashTable::with_capacity(value_capacity)` in `with_capacity`. One-line
    behavioral change.
    - Regression test `test_with_capacity_presizes_dedup` asserting
    `dedup.capacity() >= value_capacity` after construction.
    
    This fixes `StringDictionaryBuilder`, `BinaryDictionaryBuilder`, and the
    `Large*` variants (all type aliases over
    `GenericByteDictionaryBuilder`).
    
    # Are these changes tested?
    
    Yes. New test in
    `arrow-array/src/builder/generic_bytes_dictionary_builder.rs::tests`:
    
    ```rust
    #[test]
    fn test_with_capacity_presizes_dedup() {
        // `with_capacity` must size the dedup `HashTable` from 
`value_capacity`,
        // otherwise the first inserts force a chain of resize+rehash cycles.
        let value_capacity = 128;
        let builder =
            GenericByteDictionaryBuilder::<Int32Type, 
GenericStringType<i32>>::with_capacity(
                256,
                value_capacity,
                value_capacity * 32,
            );
        assert!(
            builder.dedup.capacity() >= value_capacity,
            "dedup HashTable not pre-sized: got capacity {}, expected >= {}",
            builder.dedup.capacity(),
            value_capacity,
        );
    }
    ```
    
    Verified the test fails on unpatched `main` (`got capacity 0, expected
    >= 128`) and passes with the fix. Full
    `generic_bytes_dictionary_builder` test suite (17 tests) still passes.
    `cargo clippy --lib -- -D warnings` is clean.
    
    I did not add a benchmark — the asymptotic behavior is what matters here
    (eliminating O(value_capacity) resize+rehash work on first use), and the
    existing arrow-rs benchmark suite already exercises dictionary builders.
    Happy to add one if reviewers prefer.
    
    # Are there any user-facing changes?
    
    No public API changes. Behavior change is strictly an allocation/perf
    improvement: callers of `with_capacity` who previously got a
    zero-capacity dedup table now get one sized to `value_capacity`. This
    means a slightly larger upfront allocation in exchange for avoiding the
    resize+rehash chain. Callers passing very large `value_capacity` values
    intentionally to hint the values buffer (but not expecting to insert
    that many distinct values) will now also pre-allocate that much dedup
    capacity — this matches the documented contract of the parameter (`"the
    number of distinct dictionary values"`).
    
    ---
    
    ## AI-assisted submission disclosure
    
    Per `CONTRIBUTING.md` — this PR was prepared with AI assistance
    (Claude). The bug was found while profiling a downstream Arrow producer;
    the fix and regression test were reviewed and verified locally. The
    change is one line plus a focused test, and I am happy to debug and own
    follow-ups.
    
    Co-authored-by: AI <[email protected]>
---
 .../src/builder/generic_bytes_dictionary_builder.rs | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/arrow-array/src/builder/generic_bytes_dictionary_builder.rs 
b/arrow-array/src/builder/generic_bytes_dictionary_builder.rs
index 956014a72b..4ce451aec8 100644
--- a/arrow-array/src/builder/generic_bytes_dictionary_builder.rs
+++ b/arrow-array/src/builder/generic_bytes_dictionary_builder.rs
@@ -84,7 +84,7 @@ where
     ) -> Self {
         Self {
             state: Default::default(),
-            dedup: Default::default(),
+            dedup: HashTable::with_capacity(value_capacity),
             keys_builder: PrimitiveBuilder::with_capacity(keys_capacity),
             values_builder: 
GenericByteBuilder::<T>::with_capacity(value_capacity, data_capacity),
         }
@@ -647,6 +647,25 @@ mod tests {
         test_bytes_dictionary_builder::<GenericBinaryType<i32>>(vec![b"abc", 
b"def"]);
     }
 
+    #[test]
+    fn test_with_capacity_presizes_dedup() {
+        // `with_capacity` must size the dedup `HashTable` from 
`value_capacity`,
+        // otherwise the first inserts force a chain of resize+rehash cycles.
+        let value_capacity = 128;
+        let builder =
+            GenericByteDictionaryBuilder::<Int32Type, 
GenericStringType<i32>>::with_capacity(
+                256,
+                value_capacity,
+                value_capacity * 32,
+            );
+        assert!(
+            builder.dedup.capacity() >= value_capacity,
+            "dedup HashTable not pre-sized: got capacity {}, expected >= {}",
+            builder.dedup.capacity(),
+            value_capacity,
+        );
+    }
+
     fn test_bytes_dictionary_builder_finish_cloned<T>(values: Vec<&T::Native>)
     where
         T: ByteArrayType,

Reply via email to