[PR] [hotfix] Use explicit UTF-8 charset in getBytes() calls across connectors [flink-cdc]

via GitHub Wed, 03 Jun 2026 18:51:19 -0700


dangzitou opened a new pull request, #4428:
URL: https://github.com/apache/flink-cdc/pull/4428


   ## What this PR does
   
   Replaces platform-default `String.getBytes()` with 
`String.getBytes(StandardCharsets.UTF_8)` across multiple modules to ensure 
consistent encoding behavior regardless of JVM locale or OS configuration.
   
   ### Affected files
   
   - `SchemaMergingUtils.java` + test — core schema coercion
   - `DebeziumJsonSerializationSchema.java` — Kafka Debezium JSON default value 
handling
   - `RowDataTiKVEventDeserializationSchemaBase.java` — TiDB source connector
   - `BinaryTypeReturningClass.java` / `VarBinaryTypeReturningClass.java` — UDF 
examples
   
   ### Why
   
   `String.getBytes()` without an explicit charset uses the JVM's default 
charset, which varies across environments (e.g., `US-ASCII` on some minimal 
Docker images, `GBK` on Chinese Windows). This causes silent data corruption 
when non-ASCII characters are involved. Using `UTF-8` explicitly makes the 
behavior deterministic.
   
   ## Testing
   
   These are straightforward defensive improvements — the fix ensures 
consistent UTF-8 encoding regardless of JVM locale.
   
   ---
   
   Split from #4427 as requested by @yuxiqian.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [hotfix] Use explicit UTF-8 charset in getBytes() calls across connectors [flink-cdc]

Reply via email to