dmitry-chirkov-dremio opened a new pull request, #49439:
URL: https://github.com/apache/arrow/pull/49439

   ### Rationale for this change
   The `lpad_utf8_int32_utf8` and `rpad_utf8_int32_utf8` functions have 
performance inefficiency and a potential memory safety issue:
   1. **Performance**: Single-byte fills iterate character-by-character when 
`memset` would suffice. Multi-byte fills use O(n) iterations instead of O(log 
n) with a doubling strategy.
   2. **Memory safety**: When the fill string is longer than the padding space 
needed, the code could write more bytes than allocated. Fixed preventatively.
   
   ### What changes are included in this PR?
   1. **Memory safety fix**: Use `std::min(fill_text_len, total_fill_bytes)` 
for the initial copy to prevent overflow
   2. **Fast path**: Add single-byte fill optimization using `memset`
   3. **General path**: Replace character-by-character loop with doubling 
strategy for multi-byte fills
   4. **Tests**: Add comprehensive tests for the new code paths
   
   ### Are these changes tested?
   Yes. Added tests covering:
   - Large UTF-8 fill characters (4-byte emoji, 3-byte Chinese)
   - Single-byte fill boundaries (1 char and 65536 char padding)
   - Content verification for fill patterns
   - Doubling strategy boundaries (1, 2, 3, 4, 63, 64 fills)
   - Partial fill scenarios (fill text longer than padding needed)
   
   ### Are there any user-facing changes?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to