lriggs opened a new pull request, #50187: URL: https://github.com/apache/arrow/pull/50187
### Rationale for this change Gandiva's REPLACE hardcodes a 65535-byte output buffer, throwing Buffer overflow for output string whenever the result exceeds 64 KB. The cap is arbitrary: Gandiva's variable-length output column already grows dynamically and is only bounded by the int32 offset width (~2 GB). Real queries that replace into large concatenated/aggregated strings fail unnecessarily. ### What changes are included in this PR replace_utf8_utf8_utf8 now sizes the output buffer to the exact result instead of using a fixed cap. The output length of a replace is deterministic: out_len = text_len + num_matches * (to_str_len - from_str_len) The wrapper does a single counting pass over the input to find the number of non-overlapping matches of from_str (mirroring the match loop already used in the implementation), computes the exact size in gdv_int64 to avoid intermediate overflow, and passes that as max_length. The internal replace_with_max_len_utf8_utf8_utf8 is unchanged — its bounds checks now act purely as a correctness backstop (they should never fire with an exact bound), and its explicit-max-length signature remains for the existing unit tests. When to is shorter than from, the result shrinks and max_length <= text_len, so the shrinking path is sized correctly too. ### Are these changes tested? Yes. Added regression cases to TestStringOps.TestReplace in string_ops_test.cc: A 35000-char 'X' input with X → XY, producing a 70000-byte result (previously overflowed at 65535) — asserts no error and exact length/content. A 70000-char shrinking case (XX → X) to cover the shrink path on a >64 KB input. Full precompiled suite passes locally (132/132), including the existing explicit-max_len overflow tests, which call the internal function directly and are unaffected. ### Are there any user-facing changes? REPLACE now succeeds on results larger than 64 KB instead of erroring. No API or signature changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
