aryansri05 opened a new pull request, #49513:
URL: https://github.com/apache/arrow/pull/49513
Refer#49502
Rationale for this change
When writing large dictionary-encoded Parquet data with
`ARROW_LARGE_MEMORY_TESTS=ON`, two tests were failing:
- `TestColumnWriter.WriteLargeDictEncodedPage` — expected 2 pages, got 7501
- `TestColumnWriter.ThrowsOnDictIndicesTooLarge` — expected
ParquetException,
got nothing thrown
The root cause is that `PutIndicesTyped()` in `DictEncoderImpl` had no check
for when the total number of buffered dictionary indices exceeds
`INT32_MAX`.
The existing overflow check in `FlushValues()` only checks the buffer size
in
bytes, not the index count, so it never triggered for this case.
What changes are included in this PR?
Added an overflow check in `DictEncoderImpl::PutIndicesTyped()` immediately
after `buffered_indices_.resize()`:
if (buffered_indices_.size() >
static_cast<size_t>(std::numeric_limits<int32_t>::max())) {
throw ParquetException("Total dictionary indices count (",
buffered_indices_.size(),
") exceeds maximum int value");
}
This makes the encoder throw a `ParquetException` with a message containing
"exceeds maximum int value" when the index count overflows, which is exactly
what `ThrowsOnDictIndicesTooLarge` expects.
### Are these changes tested?
Yes — the existing tests in `column_writer_test.cc` cover this fix:
- `TestColumnWriter.ThrowsOnDictIndicesTooLarge`
- `TestColumnWriter.WriteLargeDictEncodedPage`
Both tests were failing before this fix and should pass after.
Tests require building with `ARROW_LARGE_MEMORY_TESTS=ON`.
This PR contains a "Critical Fix"— previously, writing dictionary-encoded
data with more than INT32_MAX indices would silently produce incorrect
output
(wrong page count) instead of raising an error. This fix makes the encoder
correctly throw a `ParquetException` in that scenario.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]