Hi Gustavo
thanks for the proposal

I noticed that you are proposing usage of UTF8 in names (default cast
to string is also using utf8)
however I wonder if it makes sense to introduce similar utf16 similar
functions as Flink supports this as well?

On Fri, Mar 20, 2026 at 8:09 PM Gustavo de Morais
<[email protected]> wrote:
>
> Hi Xuyang and Timo,
>
> Thanks for the positive feedback! Regarding your suggestions, Xuyang:
>
> 1. Yes, good point - we should add the fromUtf8Bytes(byte[], int, int)
> overload as well.
> 2. This is also relevant. If we want to do validation during ingestion
> time, this might have performance implications. Since these are @Internal
> APIs, they can be changed independently from the FLIP afterwards if it
> makes sense. What's your opinion?
>
> Kind regards,
> Gustavo
>
> On Fri, 20 Mar 2026 at 08:44, Xuyang <[email protected]> wrote:
>
> > Hi, Gustavo.
> > Great catch! Thanks for driving this FLIP. Overall LGTM. I just have two
> > minor points I'd like to confirm with you.
> > 1. Should we also add the overload function `fromUtf8Bytes(byte[], int,
> > int)` in StringData?
> > 2. Callers like `ColumnarRowData#getString` and
> > `ColumnarArrayData#getString`  call `StringData.fromBytes` directly. Should
> > these call sites be migrated in a follow-up, or intentionally left as-is?
> >
> >
> >
> >
> >
> > --
> >
> >     Best!
> >     Xuyang
> >
> >
> >
> > At 2026-03-19 22:37:28, "Timo Walther" <[email protected]> wrote:
> > >Hi Gustavo,
> > >
> > >thank you for this excellent design document. And thanks for discovering
> > >this data loss and driving the investigation. We should definitely fix
> > >this shortcoming. Also looking at other vendors, it is definitly a cause
> > >for false assumptions that lead to hard-to-debug inconsistencies.
> > >
> > >+1 for this proposal.
> > >
> > >Cheers,
> > >Timo
> > >
> > >
> > >On 19.03.26 15:23, Gustavo de Morais wrote:
> > >> Hi everyone,
> > >>
> > >> Currently, CAST(bytes AS STRING) silently replaces any invalid UTF-8
> > byte
> > >> with U+FFFD (?). The substitution is irreversible and produces no
> > warning -
> > >> the pipeline keeps running while data is permanently corrupted
> > >> downstream. This also means that a CAST from BYTES → STRING → BYTES is
> > not
> > >> idempotent, which prevents the engine from applying certain
> > optimizations.
> > >> For example, for preserving upsert keys after such CASTs.
> > >>
> > >> I'd like to start a discussion around defining and improving the default
> > >> behavior. I've written a short FLIP [1] proposing new utility functions
> > to
> > >> handle this explicitly - similar to what other engines like Spark
> > already
> > >> do - and changing the default behavior to throw an error instead of
> > >> silently corrupting data, while giving users clear options to deal with
> > >> invalid bytes.
> > >>
> > >> Looking forward to your feedback and thoughts.
> > >>
> > >> Kind regards,
> > >> Gustavo
> > >>
> > >> [1]
> > >>
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-568%3A+Strict+BYTES-to-STRING+CAST+with+UTF-8+Validation+Utilities
> > >>
> >



-- 
Best regards,
Sergey

Reply via email to