[
https://issues.apache.org/jira/browse/FLINK-39125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gustavo de Morais updated FLINK-39125:
--------------------------------------
Fix Version/s: (was: 2.3.0)
> Support injective casts from BINARY/VARBINARY to CHAR/VARCHAR for upsert key
> preservation
> -----------------------------------------------------------------------------------------
>
> Key: FLINK-39125
> URL: https://issues.apache.org/jira/browse/FLINK-39125
> Project: Flink
> Issue Type: Improvement
> Components: Table SQL / Planner
> Affects Versions: 2.2.0
> Reporter: Gustavo de Morais
> Assignee: Gustavo de Morais
> Priority: Major
>
> When users cast a VARBINARY key column to VARCHAR, the upsert key uniqueness
> is lost because the cast is not recognized as injective.
> UTF-8 decoding is itself injective when the input is valid UTF-8 - distinct
> byte sequences always produce distinct strings - so we can safely mark these
> casts as injective when the string target has sufficient capacity. The cast
> is injective under the following conditions:
> * {{{}VARBINARY(MAX) → VARCHAR(MAX){}}}: both sides are unbounded
> * {{VARBINARY(n) → VARCHAR(m)}} where {{{}m >= n{}}}: UTF-8 multi-byte
> sequences decode to fewer characters than source bytes (each character takes
> at least 1 byte), so {{n}} bytes always decode to at most {{n}} characters
> * Bounded source to unbounded ({{{}MAX{}}}) target: always fits
> This applies to all four cross-family combinations:
> {{{}BINARY{}}}/{{{}VARBINARY{}}} to {{{}CHAR{}}}/{{{}VARCHAR{}}}.
> *Blocker:* This is currently not safe to implement. Flink's {{CAST(bytes AS
> STRING)}} silently replaces invalid UTF-8 byte sequences with the Unicode
> replacement character {{U+FFFD}} ({{{}\uFFFD{}}}), making the cast
> non-injective - two distinct byte arrays can produce the same string. This
> must be fixed first.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)