[ https://issues.apache.org/jira/browse/CASSANDRA-19537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834900#comment-17834900 ]
Brandon Williams commented on CASSANDRA-19537: ---------------------------------------------- A quick check of the [protocol spec|https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v5.spec] only mentions UTF-8, fwiw. > Unicode Code Points incorrectly sized in protocol response > ---------------------------------------------------------- > > Key: CASSANDRA-19537 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19537 > Project: Cassandra > Issue Type: Bug > Components: CQL/Interpreter > Reporter: Andrew Hogg > Priority: Normal > > Within a query, we have sent in a character which is \U0010FFFF - the highest > permissible unicode character point. This is encoded in UTF-8 using 4 bytes > and sent. When the query issues a warning in the response (such as a > tombstone warning which includes the query sent), the warning string in the > protocol is specified as a short , followed by the string. > > CBUtil.WriteString gets the length using the following code: > {code:java} > int length = TypeSizes.encodedUTF8Length(str);{code} > This in turn gets the length of the string based on a calculation: > {noformat} > public static int encodedUTF8Length(String st) > { > int strlen = st.length(); > int utflen = 0; > for (int i = 0; i < strlen; i++) > { > int c = st.charAt(i); > if ((c >= 0x0001) && (c <= 0x007F)) > utflen++; > else if (c > 0x07FF) > utflen += 3; > else > utflen += 2; > } > return utflen; > }{noformat} > The use of the st.length within this function causes problems - its > considering the string as utf-16, so the 4 byte UTF-8 value is treated as a 2 > character utf-16 value, both of which are high values and considered to be 3 > bytes in length each, making a total length of 6 bytes. > > Using some test code: > {noformat} > import java.nio.charset.StandardCharsets; > byte[] utf8Bytes = {(byte)244, (byte)143, (byte)191, (byte)191}; > var st = new String(utf8Bytes, StandardCharsets.UTF_8); > System.out.println(st); > int strlen = st.length(); > System.out.println(strlen); > int utflen = 0; > for (int i = 0; i < strlen; i++) > { > int c = st.charAt(i); > if ((c >= 0x0001) && (c <= 0x007F)) > utflen++; > else if (c > 0x07FF) { > utflen += 3; > } > else > utflen += 2; > } > System.out.println(utflen); > byte[] utf8Bytes = st.getBytes(StandardCharsets.UTF_8); > for (byte b : utf8Bytes) { > System.out.print(b & 0xFF); > System.out.printf(" "); > } > {noformat} > The 4 byte UTF-8, is seen by st.length as 2, which then considered the value > of each utf-16 as 56319 and 57343 respectively, and since this is above the > 2047 (0x07FF), adds 3 to the length each time. > The response message at a byte level does correctly return the UTF-8 > character in as 244 143 191 191, but the incorrect length results in a buffer > overread, which offsets the following reads, resulting in a few different > possible errors, but all relating to misalignment of the buffer read vs > expected value at that point in the buffer. > > Issue specifically found in 4.1, but appears to have existed for a while - > and is specifically due to operating outside of the UTF-16 BMP range but into > the higher planes. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org