Re:Re: [DISCUSS] FLIP-568: Strict BYTES-to-STRING CAST with UTF-8 Validation Utilities

Xuyang Fri, 20 Mar 2026 00:44:44 -0700

Hi, Gustavo.
Great catch! Thanks for driving this FLIP. Overall LGTM. I just have two minor 
points I'd like to confirm with you.
1. Should we also add the overload function `fromUtf8Bytes(byte[], int, int)` 
in StringData?
2. Callers like `ColumnarRowData#getString` and `ColumnarArrayData#getString`  
call `StringData.fromBytes` directly. Should these call sites be migrated in a 
follow-up, or intentionally left as-is?






--

    Best！
    Xuyang



At 2026-03-19 22:37:28, "Timo Walther" <[email protected]> wrote:
>Hi Gustavo,
>
>thank you for this excellent design document. And thanks for discovering 
>this data loss and driving the investigation. We should definitely fix 
>this shortcoming. Also looking at other vendors, it is definitly a cause 
>for false assumptions that lead to hard-to-debug inconsistencies.
>
>+1 for this proposal.
>
>Cheers,
>Timo
>
>
>On 19.03.26 15:23, Gustavo de Morais wrote:
>> Hi everyone,
>> 
>> Currently, CAST(bytes AS STRING) silently replaces any invalid UTF-8 byte
>> with U+FFFD (?). The substitution is irreversible and produces no warning -
>> the pipeline keeps running while data is permanently corrupted
>> downstream. This also means that a CAST from BYTES → STRING → BYTES is not
>> idempotent, which prevents the engine from applying certain optimizations.
>> For example, for preserving upsert keys after such CASTs.
>> 
>> I'd like to start a discussion around defining and improving the default
>> behavior. I've written a short FLIP [1] proposing new utility functions to
>> handle this explicitly - similar to what other engines like Spark already
>> do - and changing the default behavior to throw an error instead of
>> silently corrupting data, while giving users clear options to deal with
>> invalid bytes.
>> 
>> Looking forward to your feedback and thoughts.
>> 
>> Kind regards,
>> Gustavo
>> 
>> [1]
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-568%3A+Strict+BYTES-to-STRING+CAST+with+UTF-8+Validation+Utilities
>>

Re:Re: [DISCUSS] FLIP-568: Strict BYTES-to-STRING CAST with UTF-8 Validation Utilities

Reply via email to