Re: [PR] [FLINK-39601][table] Add UTF-8 validation utilities and StringData.fromUtf8Bytes connector API [flink]

via GitHub Tue, 05 May 2026 04:31:16 -0700


twalthr commented on code in PR #28110:
URL: https://github.com/apache/flink/pull/28110#discussion_r3188051058



##########
flink-table/flink-table-common/src/main/java/org/apache/flink/table/data/binary/StringUtf8Utils.java:
##########
@@ -131,6 +132,139 @@ public static String decodeUTF8(byte[] input, int offset, 
int byteLen) {
         return new String(chars, 0, len);
     }
 
+    // Bit-pattern predicates for UTF-8 byte categorization. The JIT inlines 
these so they cost
+    // nothing at runtime, but they make {@link #firstInvalidUtf8ByteIndex} 
read like prose.
+    private static boolean isAsciiByte(int b) {
+        return b >= 0;
+    }
+
+    private static boolean is2ByteLead(int b) {
+        // 110xxxxx; (b & 0x1e) != 0 rejects the overlong leads 0xC0 and 0xC1
+        return (b >> 5) == -2 && (b & 0x1e) != 0;
+    }
+
+    private static boolean is3ByteLead(int b) {
+        return (b >> 4) == -2; // 1110xxxx
+    }
+
+    private static boolean is4ByteLead(int b) {
+        return (b >> 3) == -2; // 11110xxx
+    }
+
+    private static boolean isContinuation(int b) {
+        return (b & 0xc0) == 0x80; // 10xxxxxx
+    }
+
+    private static boolean isOverlong3(int b1, int b2) {
+        // 0xE0 followed by 0x80-0x9F encodes a code point already 
representable in 2 bytes
+        return b1 == (byte) 0xe0 && (b2 & 0xe0) == 0x80;
+    }
+
+    private static char decode3ByteSequence(int b1, int b2, int b3) {
+        return (char)
+                ((b1 << 12)
+                        ^ (b2 << 6)
+                        ^ (b3 ^ (((byte) 0xE0 << 12) ^ ((byte) 0x80 << 6) ^ 
((byte) 0x80))));
+    }
+
+    private static int decode4ByteSequence(int b1, int b2, int b3, int b4) {
+        return (b1 << 18)
+                ^ (b2 << 12)
+                ^ (b3 << 6)
+                ^ (b4
+                        ^ (((byte) 0xF0 << 18)
+                                ^ ((byte) 0x80 << 12)
+                                ^ ((byte) 0x80 << 6)
+                                ^ ((byte) 0x80)));
+    }
+
+    /**
+     * Returns the absolute index (into {@code bytes}) of the first byte that 
breaks UTF-8
+     * well-formedness, or {@code -1} if the range is valid. For a truncated 
trailing sequence the
+     * returned index is {@code offset + numBytes} (one past the last byte) 
since the failure is the
+     * absence of an expected continuation byte. Same byte-level checks as 
{@link
+     * #decodeUTF8Strict(byte[], int, int, char[])} but without the 
char-buffer write side effect.
+     *
+     * @throws NullPointerException if {@code bytes} is null
+     * @throws IllegalArgumentException if the offset/length range is 
out-of-bounds
+     */
+    public static int firstInvalidUtf8ByteIndex(
+            final byte[] bytes, final int offset, final int numBytes) {
+        Preconditions.checkNotNull(bytes, "bytes must not be null");
+        Preconditions.checkArgument(offset >= 0, "offset must be >= 0, was 
%s", offset);
+        Preconditions.checkArgument(numBytes >= 0, "numBytes must be >= 0, was 
%s", numBytes);
+        Preconditions.checkArgument(
+                offset <= bytes.length - numBytes,
+                "offset (%s) + numBytes (%s) exceeds array length (%s)",
+                offset,
+                numBytes,
+                bytes.length);
+

Review Comment:
   from my point of view, we can skip those checks for performance



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [FLINK-39601][table] Add UTF-8 validation utilities and StringData.fromUtf8Bytes connector API [flink]

Reply via email to