Andrew Hogg created CASSANDRA-19537:
---------------------------------------

             Summary: Unicode Code Points outside of BMP incorrectly sized in 
protocol response
                 Key: CASSANDRA-19537
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19537
             Project: Cassandra
          Issue Type: Bug
          Components: CQL/Interpreter
            Reporter: Andrew Hogg


Within a query, we have sent in a character which is \U0010FFFF - the highest 
permissible unicode character point. This is encoded in UTF-8 using 4 bytes and 
sent. When the query issues a warning in the response, the warning string in 
the protocol is specified as a short , followed by the string.

 

CBUtil.WriteString gets the length using the following code:
{code:java}
int length = TypeSizes.encodedUTF8Length(str);{code}
This in turn gets the length of the string based on a calculation:
{noformat}
public static int encodedUTF8Length(String st)
{
int strlen = st.length();
int utflen = 0;
for (int i = 0; i < strlen; i++)
{
int c = st.charAt(i);
if ((c >= 0x0001) && (c <= 0x007F))
utflen++;
else if (c > 0x07FF)
utflen += 3;
else
utflen += 2;
}
return utflen;
}{noformat}
The use of the st.length within this function causes problems - its considering 
the string as utf-16, so the 4 byte UTF-8 value is treated as a 2 character 
utf-16 value, both of which are high values and considered to be 3 bytes in 
length each, making a total length of 6 bytes.
 
Using some test code:
{noformat}
import java.nio.charset.StandardCharsets;
byte[] utf8Bytes = {(byte)244, (byte)143, (byte)191, (byte)191};
var st = new String(utf8Bytes, StandardCharsets.UTF_8);
System.out.println(st);
int strlen = st.length();
System.out.println(strlen);
int utflen = 0;
for (int i = 0; i < strlen; i++)
{
  int c = st.charAt(i);
  if ((c >= 0x0001) && (c <= 0x007F))
    utflen++;
  else if (c > 0x07FF) {
    utflen += 3;
  }
  else
    utflen += 2;
}
System.out.println(utflen);
byte[] utf8Bytes = st.getBytes(StandardCharsets.UTF_8);
for (byte b : utf8Bytes) {
  System.out.print(b & 0xFF);
  System.out.printf(" ");
}
{noformat}
The 4 byte UTF-8, is seen by st.length as 2, which then considered the value of 
each utf-16  as 56319 and 57343 respectively, and since this is above the 2047 
(0x07FF,  adds 3 to the length each time.

The response message at a byte level does correctly return the UTF-8 character 
in as 244 143 191 191, but the incorrect length results in a buffer overread, 
which offsets the following reads, resulting in a few different possible 
errors, but all relating to misalignment of the buffer read vs expected value 
at that point in the buffer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to