twalthr commented on code in PR #28110:
URL: https://github.com/apache/flink/pull/28110#discussion_r3187016081
##########
flink-table/flink-table-common/src/main/java/org/apache/flink/table/data/binary/StringUtf8Utils.java:
##########
@@ -131,6 +132,139 @@ public static String decodeUTF8(byte[] input, int offset,
int byteLen) {
return new String(chars, 0, len);
}
+ // Bit-pattern predicates for UTF-8 byte categorization. The JIT inlines
these so they cost
+ // nothing at runtime, but they make {@link #firstInvalidUtf8ByteIndex}
read like prose.
+ private static boolean isAsciiByte(int b) {
+ return b >= 0;
+ }
+
+ private static boolean is2ByteLead(int b) {
+ // 110xxxxx; (b & 0x1e) != 0 rejects the overlong leads 0xC0 and 0xC1
+ return (b >> 5) == -2 && (b & 0x1e) != 0;
+ }
+
+ private static boolean is3ByteLead(int b) {
+ return (b >> 4) == -2; // 1110xxxx
+ }
+
+ private static boolean is4ByteLead(int b) {
+ return (b >> 3) == -2; // 11110xxx
+ }
+
+ private static boolean isContinuation(int b) {
+ return (b & 0xc0) == 0x80; // 10xxxxxx
+ }
+
+ private static boolean isOverlong3(int b1, int b2) {
+ // 0xE0 followed by 0x80-0x9F encodes a code point already
representable in 2 bytes
+ return b1 == (byte) 0xe0 && (b2 & 0xe0) == 0x80;
+ }
+
+ private static char decode3ByteSequence(int b1, int b2, int b3) {
+ return (char)
+ ((b1 << 12)
+ ^ (b2 << 6)
+ ^ (b3 ^ (((byte) 0xE0 << 12) ^ ((byte) 0x80 << 6) ^
((byte) 0x80))));
+ }
+
+ private static int decode4ByteSequence(int b1, int b2, int b3, int b4) {
+ return (b1 << 18)
+ ^ (b2 << 12)
+ ^ (b3 << 6)
+ ^ (b4
+ ^ (((byte) 0xF0 << 18)
+ ^ ((byte) 0x80 << 12)
+ ^ ((byte) 0x80 << 6)
+ ^ ((byte) 0x80)));
+ }
+
+ /**
+ * Returns the absolute index (into {@code bytes}) of the first byte that
breaks UTF-8
+ * well-formedness, or {@code -1} if the range is valid. For a truncated
trailing sequence the
+ * returned index is {@code offset + numBytes} (one past the last byte)
since the failure is the
+ * absence of an expected continuation byte. Same byte-level checks as
{@link
+ * #decodeUTF8Strict(byte[], int, int, char[])} but without the
char-buffer write side effect.
+ *
+ * @throws NullPointerException if {@code bytes} is null
+ * @throws IllegalArgumentException if the offset/length range is
out-of-bounds
+ */
+ public static int firstInvalidUtf8ByteIndex(
Review Comment:
Is this copied from a reliable source? How can we ensure this is correct?
##########
flink-table/flink-table-common/src/main/java/org/apache/flink/table/data/binary/BinaryStringData.java:
##########
@@ -81,20 +81,59 @@ public static BinaryStringData fromString(String str) {
}
}
- /** Creates a {@link BinaryStringData} instance from the given UTF-8
bytes. */
+ /**
+ * Creates a {@link BinaryStringData} instance by wrapping the given UTF-8
bytes in O(1) without
+ * copying or validating them. The caller is responsible for ensuring the
bytes are well-formed
+ * UTF-8; use {@link #fromUtf8Bytes(byte[])} if validation is required.
+ */
public static BinaryStringData fromBytes(byte[] bytes) {
return fromBytes(bytes, 0, bytes.length);
}
/**
- * Creates a {@link BinaryStringData} instance from the given UTF-8 bytes
with offset and number
- * of bytes.
+ * Creates a {@link BinaryStringData} instance by wrapping the given UTF-8
byte range in O(1)
+ * without copying or validating it. The caller is responsible for
ensuring the bytes are
+ * well-formed UTF-8; use {@link #fromUtf8Bytes(byte[], int, int)} if
validation is required.
*/
public static BinaryStringData fromBytes(byte[] bytes, int offset, int
numBytes) {
return new BinaryStringData(
new MemorySegment[] {MemorySegmentFactory.wrap(bytes)},
offset, numBytes);
}
+ /**
+ * Creates a {@link BinaryStringData} instance from the given UTF-8 bytes,
walking the input
+ * once in O(n) to verify it is well-formed UTF-8. Returns {@code null} if
the input is {@code
+ * null}. Throws {@link IllegalArgumentException} on invalid UTF-8. Use
{@link
Review Comment:
```suggestion
* null}. Throws {@link TableRuntimeException} on invalid UTF-8. Use
{@link
```
##########
flink-table/flink-table-common/src/test/java/org/apache/flink/table/data/binary/BinaryStringDataFromUtf8BytesTest.java:
##########
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.table.data.binary;
+
+import org.apache.flink.table.data.StringData;
+
+import org.junit.jupiter.api.Test;
+
+import java.nio.charset.StandardCharsets;
+
+import static org.assertj.core.api.Assertions.assertThat;
+import static org.assertj.core.api.Assertions.assertThatThrownBy;
+
+/**
+ * Tests for {@link BinaryStringData#fromUtf8Bytes(byte[])} and {@link
StringData#fromUtf8Bytes}.
+ */
+class BinaryStringDataFromUtf8BytesTest {
+
+ @Test
+ void testFromUtf8Bytes() {
+ // Valid input: ASCII, multi-byte, empty, and the offset/length
variant all yield the
+ // same instance you would get from the unchecked fromBytes /
fromString factory.
+ final byte[] hello = "Hello, world!".getBytes(StandardCharsets.UTF_8);
+ assertThat(BinaryStringData.fromUtf8Bytes(hello))
+ .isEqualTo(BinaryStringData.fromBytes(hello));
+
+ final byte[] multibyte = "é€😀".getBytes(StandardCharsets.UTF_8);
Review Comment:
we don't need to duplicate tests here. Having good tests for
`StringUtf8UtilsTest` is enough for me. And maybe one little test that tests
the basic error behavior for `BinaryStringData.fromUtf8Bytes`
##########
flink-table/flink-table-common/src/main/java/org/apache/flink/table/data/binary/BinaryStringData.java:
##########
@@ -81,20 +81,59 @@ public static BinaryStringData fromString(String str) {
}
}
- /** Creates a {@link BinaryStringData} instance from the given UTF-8
bytes. */
+ /**
+ * Creates a {@link BinaryStringData} instance by wrapping the given UTF-8
bytes in O(1) without
+ * copying or validating them. The caller is responsible for ensuring the
bytes are well-formed
+ * UTF-8; use {@link #fromUtf8Bytes(byte[])} if validation is required.
Review Comment:
Maybe inform the reader what happens if the contract is violated:
```
Otherwise invalid UTF-8 propagates and is later silently substituted with
U+FFFD.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]