[
https://issues.apache.org/jira/browse/FLINK-39600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Timo Walther closed FLINK-39600.
--------------------------------
Fix Version/s: 2.4.0
Resolution: Fixed
> FLIP-568: Strict BYTES-to-STRING CAST with UTF-8 Validation Utilities
> ---------------------------------------------------------------------
>
> Key: FLINK-39600
> URL: https://issues.apache.org/jira/browse/FLINK-39600
> Project: Flink
> Issue Type: New Feature
> Components: Table SQL / API
> Reporter: Gustavo de Morais
> Assignee: Gustavo de Morais
> Priority: Critical
> Fix For: 2.4.0
>
>
> {{CAST(bytes AS STRING)}} today silently replaces invalid UTF-8 with the
> Unicode replacement character {{{}U+FFFD{}}}. The substitution is
> irreversible and produces no warning - pipelines keep running while data is
> permanently corrupted downstream. This also blocks engine optimizations that
> need injective guarantees (e.g. upsert key propagation through {{{}BINARY ->
> STRING{}}}).
> [FLIP-568|https://cwiki.apache.org/confluence/display/FLINK/FLIP-568%3A+Strict+BYTES-to-STRING+CAST+with+UTF-8+Validation+Utilities]
> addresses this by:
> # Making {{CAST(bytes AS STRING)}} throw on invalid UTF-8. {{TRY_CAST}}
> returns {{{}NULL{}}}. A migration flag restores the legacy behavior.
> # Adding two SQL functions:
> ** {{IS_VALID_UTF8(bytes) -> BOOLEAN}} for routing invalid records to a
> dead-letter sink
> ** {{MAKE_VALID_UTF8(bytes) -> STRING}} as the explicit, opt-in substitution
> recipe
> # Adding {{StringData.fromUtf8Bytes(byte[])}} connector API that validates
> at ingestion and throws on invalid input.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)