Andrew Hogg created CASSANDRA-19537: ---------------------------------------
Summary: Unicode Code Points outside of BMP incorrectly sized in protocol response Key: CASSANDRA-19537 URL: https://issues.apache.org/jira/browse/CASSANDRA-19537 Project: Cassandra Issue Type: Bug Components: CQL/Interpreter Reporter: Andrew Hogg Within a query, we have sent in a character which is \U0010FFFF - the highest permissible unicode character point. This is encoded in UTF-8 using 4 bytes and sent. When the query issues a warning in the response, the warning string in the protocol is specified as a short , followed by the string. CBUtil.WriteString gets the length using the following code: {code:java} int length = TypeSizes.encodedUTF8Length(str);{code} This in turn gets the length of the string based on a calculation: {noformat} public static int encodedUTF8Length(String st) { int strlen = st.length(); int utflen = 0; for (int i = 0; i < strlen; i++) { int c = st.charAt(i); if ((c >= 0x0001) && (c <= 0x007F)) utflen++; else if (c > 0x07FF) utflen += 3; else utflen += 2; } return utflen; }{noformat} The use of the st.length within this function causes problems - its considering the string as utf-16, so the 4 byte UTF-8 value is treated as a 2 character utf-16 value, both of which are high values and considered to be 3 bytes in length each, making a total length of 6 bytes. Using some test code: {noformat} import java.nio.charset.StandardCharsets; byte[] utf8Bytes = {(byte)244, (byte)143, (byte)191, (byte)191}; var st = new String(utf8Bytes, StandardCharsets.UTF_8); System.out.println(st); int strlen = st.length(); System.out.println(strlen); int utflen = 0; for (int i = 0; i < strlen; i++) { int c = st.charAt(i); if ((c >= 0x0001) && (c <= 0x007F)) utflen++; else if (c > 0x07FF) { utflen += 3; } else utflen += 2; } System.out.println(utflen); byte[] utf8Bytes = st.getBytes(StandardCharsets.UTF_8); for (byte b : utf8Bytes) { System.out.print(b & 0xFF); System.out.printf(" "); } {noformat} The 4 byte UTF-8, is seen by st.length as 2, which then considered the value of each utf-16 as 56319 and 57343 respectively, and since this is above the 2047 (0x07FF, adds 3 to the length each time. The response message at a byte level does correctly return the UTF-8 character in as 244 143 191 191, but the incorrect length results in a buffer overread, which offsets the following reads, resulting in a few different possible errors, but all relating to misalignment of the buffer read vs expected value at that point in the buffer. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org