This is an automated email from the ASF dual-hosted git repository.
Dandandan pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git
The following commit(s) were added to refs/heads/main by this push:
new 3790d88b5e Pre-size dedup HashTable in
GenericByteDictionaryBuilder::with_capacity (#9908)
3790d88b5e is described below
commit 3790d88b5e03a0b68c68a288770c980b63772ccb
Author: Sebastian Rabenhorst <[email protected]>
AuthorDate: Tue May 5 20:22:32 2026 +0200
Pre-size dedup HashTable in GenericByteDictionaryBuilder::with_capacity
(#9908)
# Which issue does this PR close?
- Closes #9907.
# Rationale for this change
`GenericByteDictionaryBuilder::with_capacity` left the internal `dedup`
`HashTable` at default capacity (0), forcing 0 → 4 → 8 → 16 → 32 → 64 →
… resize+rehash cycles on the first inserts even when callers passed a
non-zero `value_capacity`.
`new()` already sizes `dedup` from `keys_builder.capacity()`;
`with_capacity` (the variant intended to amortize allocations) should do
the same from `value_capacity`. As-is, callers reaching for
`with_capacity` to avoid alloc churn get the opposite of what they asked
for on the dedup path.
See #9907 for the full context including a CPU profile showing
`GenericByteDictionaryBuilder::append` at 7.24% cum where ~half is the
resize+rehash cost on the first batch.
# What changes are included in this PR?
- `arrow-array/src/builder/generic_bytes_dictionary_builder.rs`: change
`dedup: Default::default()` to `dedup:
HashTable::with_capacity(value_capacity)` in `with_capacity`. One-line
behavioral change.
- Regression test `test_with_capacity_presizes_dedup` asserting
`dedup.capacity() >= value_capacity` after construction.
This fixes `StringDictionaryBuilder`, `BinaryDictionaryBuilder`, and the
`Large*` variants (all type aliases over
`GenericByteDictionaryBuilder`).
# Are these changes tested?
Yes. New test in
`arrow-array/src/builder/generic_bytes_dictionary_builder.rs::tests`:
```rust
#[test]
fn test_with_capacity_presizes_dedup() {
// `with_capacity` must size the dedup `HashTable` from
`value_capacity`,
// otherwise the first inserts force a chain of resize+rehash cycles.
let value_capacity = 128;
let builder =
GenericByteDictionaryBuilder::<Int32Type,
GenericStringType<i32>>::with_capacity(
256,
value_capacity,
value_capacity * 32,
);
assert!(
builder.dedup.capacity() >= value_capacity,
"dedup HashTable not pre-sized: got capacity {}, expected >= {}",
builder.dedup.capacity(),
value_capacity,
);
}
```
Verified the test fails on unpatched `main` (`got capacity 0, expected
>= 128`) and passes with the fix. Full
`generic_bytes_dictionary_builder` test suite (17 tests) still passes.
`cargo clippy --lib -- -D warnings` is clean.
I did not add a benchmark — the asymptotic behavior is what matters here
(eliminating O(value_capacity) resize+rehash work on first use), and the
existing arrow-rs benchmark suite already exercises dictionary builders.
Happy to add one if reviewers prefer.
# Are there any user-facing changes?
No public API changes. Behavior change is strictly an allocation/perf
improvement: callers of `with_capacity` who previously got a
zero-capacity dedup table now get one sized to `value_capacity`. This
means a slightly larger upfront allocation in exchange for avoiding the
resize+rehash chain. Callers passing very large `value_capacity` values
intentionally to hint the values buffer (but not expecting to insert
that many distinct values) will now also pre-allocate that much dedup
capacity — this matches the documented contract of the parameter (`"the
number of distinct dictionary values"`).
---
## AI-assisted submission disclosure
Per `CONTRIBUTING.md` — this PR was prepared with AI assistance
(Claude). The bug was found while profiling a downstream Arrow producer;
the fix and regression test were reviewed and verified locally. The
change is one line plus a focused test, and I am happy to debug and own
follow-ups.
Co-authored-by: AI <[email protected]>
---
.../src/builder/generic_bytes_dictionary_builder.rs | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)
diff --git a/arrow-array/src/builder/generic_bytes_dictionary_builder.rs
b/arrow-array/src/builder/generic_bytes_dictionary_builder.rs
index 956014a72b..4ce451aec8 100644
--- a/arrow-array/src/builder/generic_bytes_dictionary_builder.rs
+++ b/arrow-array/src/builder/generic_bytes_dictionary_builder.rs
@@ -84,7 +84,7 @@ where
) -> Self {
Self {
state: Default::default(),
- dedup: Default::default(),
+ dedup: HashTable::with_capacity(value_capacity),
keys_builder: PrimitiveBuilder::with_capacity(keys_capacity),
values_builder:
GenericByteBuilder::<T>::with_capacity(value_capacity, data_capacity),
}
@@ -647,6 +647,25 @@ mod tests {
test_bytes_dictionary_builder::<GenericBinaryType<i32>>(vec![b"abc",
b"def"]);
}
+ #[test]
+ fn test_with_capacity_presizes_dedup() {
+ // `with_capacity` must size the dedup `HashTable` from
`value_capacity`,
+ // otherwise the first inserts force a chain of resize+rehash cycles.
+ let value_capacity = 128;
+ let builder =
+ GenericByteDictionaryBuilder::<Int32Type,
GenericStringType<i32>>::with_capacity(
+ 256,
+ value_capacity,
+ value_capacity * 32,
+ );
+ assert!(
+ builder.dedup.capacity() >= value_capacity,
+ "dedup HashTable not pre-sized: got capacity {}, expected >= {}",
+ builder.dedup.capacity(),
+ value_capacity,
+ );
+ }
+
fn test_bytes_dictionary_builder_finish_cloned<T>(values: Vec<&T::Native>)
where
T: ByteArrayType,