gyanranjanpanda opened a new pull request, #9373:
URL: https://github.com/apache/arrow-rs/pull/9373
## Summary
This PR fixes a bug where dictionary concatenation incorrectly rejects valid
boundary cases when using the full range of key types.
Fixes #9366
## Problem
Dictionary concatenation currently fails when using the maximum valid number
of distinct values for u8 and u16 key types:
- u8 keys: 256 distinct values (valid range 0-255) are rejected
- u16 keys: 65,536 distinct values (valid range 0-65,535) are rejected
This happens because of two separate off-by-one errors in the boundary
checks.
## Root Cause
### Issue 1: `arrow-select/src/dictionary.rs` (line 292)
The boundary check happens **after** pushing to the indices vector instead
of before:
```rust
// Before (buggy):
K::Native::from_usize(indices.len()) // Checked after push
```
When concatenating 256 values with u8 keys:
- Values 0-254: `indices.len()` = 0-254 (valid)
- Value 255: `indices.len()` = 255 (valid), then push
- Next value: `indices.len()` = 256 (invalid for u8, but should be valid)
### Issue 2: `arrow-data/src/transform/mod.rs` (line 197)
The check validates the length instead of the maximum index:
```rust
// Before (buggy):
let _: $dt = max.try_into().ok()?; // Checks length, not max index
```
For 256 values: `max=256` doesn't fit in u8, but `max_index=255` does.
## Solution
### Fix 1: Check before pushing to indices
```rust
let next_idx = indices.len();
let key = K::Native::from_usize(next_idx)
.ok_or_else(|| ArrowError::DictionaryKeyOverflowError)?;
indices.push((dictionary_idx, value_idx));
Ok(key)
```
### Fix 2: Validate maximum index instead of length
```rust
if max > 0 {
let max_index = max - 1;
let _: $dt = max_index.try_into().ok()?;
}
```
## Testing
Added comprehensive tests covering:
**Unit tests** (13 total):
- u8 boundary: 256 values succeed, 257 fail
- u16 boundary: 65,536 values succeed, 65,537 fail
- Overlap handling with deduplication
**Integration tests** (4 total):
- End-to-end concat with u8 boundary (256 values)
- End-to-end concat with u16 boundary (65,536 values)
- Verification that errors are returned instead of panics
- Overflow cases correctly fail
All existing tests continue to pass (46 concat tests, 13 dictionary tests).
## Impact
This fix enables:
- Full use of dictionary key ranges in arrow-rs
- Removal of workarounds in downstream projects
- Proper error handling instead of panics at boundary values
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]