dmitry-chirkov-dremio commented on issue #49614: URL: https://github.com/apache/arrow/issues/49614#issuecomment-4148638537
**Risk analysis of changing base64_decode behavior** All callers of `arrow::util::base64_decode` in the C++ codebase consume machine-generated base64 that was previously produced by base64_encode - these are pure roundtrip scenarios that would be unaffected by stricter validation: - **parquet/arrow/schema.cc** - decodes an Arrow schema serialized into Parquet metadata via base64_encode - **parquet/encryption/key_toolkit_internal.cc** - decodes encrypted key material produced by base64_encode - **parquet/encryption/file_key_unwrapper.cc** - decodes a KEK ID produced by base64_encode - **arrow/flight/flight_test.cc** - decodes base64 credentials in a test The only caller that can receive arbitrary (potentially malformed) input is Gandiva's `gdv_fn_base64_decode_utf8` (that's the source of issue discovery), which exposes `base64_decode` as a user-facing SQL function (`unbase64`). Today, if a user passes malformed base64 to this function, they get silent partial results - which is arguably a correctness bug, not a feature. Changing `base64_decode` to return an error on invalid characters (or changing the signature to `Result<std::string>`) would require updating all 5 call sites, but the actual behavioral change is limited to Gandiva's SQL function since the other callers never pass invalid input. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
