Re: [PR] [FLINK-39602][table] Add IS_VALID_UTF8 and MAKE_VALID_UTF8 built-in functions [flink]

via GitHub Tue, 05 May 2026 02:58:41 -0700


twalthr commented on code in PR #28111:
URL: https://github.com/apache/flink/pull/28111#discussion_r3187524509



##########
docs/data/sql_functions.yml:
##########
@@ -805,6 +805,22 @@ conversion:
       call("TYPEOF", input)
       call("TYPEOF", input, force_serializable)
     description: Returns the string representation of the input expression's 
data type. By default, the returned string is a summary string that might omit 
certain details for readability. If force_serializable is set to TRUE, the 
string represents a full data type that could be persisted in a catalog. Note 
that especially anonymous, inline data types have no serializable string 
representation. In this case, NULL is returned.
+  - sql: IS_VALID_UTF8(bytes)
+    table: BYTES.isValidUtf8()
+    description: |
+      Returns `TRUE` if the input is well-formed UTF-8, `FALSE` otherwise. 
Specifically rejects: truncated multi-byte sequences (missing continuation 
bytes), "overlong" encodings (using more bytes than necessary for the code 
point), code points above the Unicode maximum U+10FFFF, and UTF-16 surrogate 
values U+D800-U+DFFF (which have no UTF-8 representation). Returns `NULL` if 
the input is `NULL`.
+
+      Useful for routing records with invalid UTF-8 to a dead-letter sink: 
`WHERE IS_VALID_UTF8(payload)` keeps clean rows; `WHERE NOT 
IS_VALID_UTF8(payload)` selects the rejects.
+
+      E.g., `IS_VALID_UTF8(x'48656C6C6F')` returns `TRUE`; 
`IS_VALID_UTF8(x'80')` returns `FALSE`.
+  - sql: MAKE_VALID_UTF8(bytes)
+    table: BYTES.makeValidUtf8()
+    description: |
+      Decodes the input as UTF-8, replacing each invalid sequence with the 
Unicode replacement character `U+FFFD` (rendered as `�`). The substitution is 
lossy and irreversible. Returns `NULL` if the input is `NULL`.
+
+      If you want to explicitly have the behavior of silently substituting 
invalid bytes with `U+FFFD` when doing a `CAST(bytes AS STRING)`, replace the 
cast with `MAKE_VALID_UTF8(bytes)`.

Review Comment:
   I find this comment confusing. How about "MAKE_VALID_UTF8() can fully 
replace a CAST(bytes AS STRING) which would error in case of invalid UTF-8"



##########
docs/data/sql_functions.yml:
##########
@@ -805,6 +805,22 @@ conversion:
       call("TYPEOF", input)
       call("TYPEOF", input, force_serializable)
     description: Returns the string representation of the input expression's 
data type. By default, the returned string is a summary string that might omit 
certain details for readability. If force_serializable is set to TRUE, the 
string represents a full data type that could be persisted in a catalog. Note 
that especially anonymous, inline data types have no serializable string 
representation. In this case, NULL is returned.
+  - sql: IS_VALID_UTF8(bytes)
+    table: BYTES.isValidUtf8()
+    description: |
+      Returns `TRUE` if the input is well-formed UTF-8, `FALSE` otherwise. 
Specifically rejects: truncated multi-byte sequences (missing continuation 
bytes), "overlong" encodings (using more bytes than necessary for the code 
point), code points above the Unicode maximum U+10FFFF, and UTF-16 surrogate 
values U+D800-U+DFFF (which have no UTF-8 representation). Returns `NULL` if 
the input is `NULL`.
+
+      Useful for routing records with invalid UTF-8 to a dead-letter sink: 
`WHERE IS_VALID_UTF8(payload)` keeps clean rows; `WHERE NOT 
IS_VALID_UTF8(payload)` selects the rejects.

Review Comment:
   ```suggestion
         Useful for filtering records with invalid UTF-8: `WHERE 
IS_VALID_UTF8(payload)` keeps clean rows; `WHERE NOT IS_VALID_UTF8(payload)` 
selects the rejects.
   ```



##########
flink-table/flink-table-api-java/src/main/java/org/apache/flink/table/api/internal/BaseExpressions.java:
##########
@@ -1493,6 +1495,28 @@ public OutType inetNtoa() {
         return toApiSpecificExpression(unresolvedCall(INET_NTOA, toExpr()));
     }
 
+    /**
+     * Returns {@code true} if the input bytes form a well-formed UTF-8 
sequence, {@code false}

Review Comment:
   ```suggestion
        * Returns {@code true} if the input bytes are a well-formed UTF-8 
sequence, {@code false}
   ```



##########
docs/data/sql_functions.yml:
##########
@@ -805,6 +805,22 @@ conversion:
       call("TYPEOF", input)
       call("TYPEOF", input, force_serializable)
     description: Returns the string representation of the input expression's 
data type. By default, the returned string is a summary string that might omit 
certain details for readability. If force_serializable is set to TRUE, the 
string represents a full data type that could be persisted in a catalog. Note 
that especially anonymous, inline data types have no serializable string 
representation. In this case, NULL is returned.
+  - sql: IS_VALID_UTF8(bytes)
+    table: BYTES.isValidUtf8()
+    description: |
+      Returns `TRUE` if the input is well-formed UTF-8, `FALSE` otherwise. 
Specifically rejects: truncated multi-byte sequences (missing continuation 
bytes), "overlong" encodings (using more bytes than necessary for the code 
point), code points above the Unicode maximum U+10FFFF, and UTF-16 surrogate 
values U+D800-U+DFFF (which have no UTF-8 representation). Returns `NULL` if 
the input is `NULL`.
+
+      Useful for routing records with invalid UTF-8 to a dead-letter sink: 
`WHERE IS_VALID_UTF8(payload)` keeps clean rows; `WHERE NOT 
IS_VALID_UTF8(payload)` selects the rejects.

Review Comment:
   let's not advertise features that don't exist



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [FLINK-39602][table] Add IS_VALID_UTF8 and MAKE_VALID_UTF8 built-in functions [flink]

Reply via email to