[ 
https://issues.apache.org/jira/browse/CASSANDRA-19537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834900#comment-17834900
 ] 

Brandon Williams commented on CASSANDRA-19537:
----------------------------------------------

A quick check of the [protocol 
spec|https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v5.spec]
 only mentions UTF-8, fwiw.

> Unicode Code Points incorrectly sized in protocol response
> ----------------------------------------------------------
>
>                 Key: CASSANDRA-19537
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19537
>             Project: Cassandra
>          Issue Type: Bug
>          Components: CQL/Interpreter
>            Reporter: Andrew Hogg
>            Priority: Normal
>
> Within a query, we have sent in a character which is \U0010FFFF - the highest 
> permissible unicode character point. This is encoded in UTF-8 using 4 bytes 
> and sent. When the query issues a warning in the response (such as a 
> tombstone warning which includes the query sent), the warning string in the 
> protocol is specified as a short , followed by the string.
>  
> CBUtil.WriteString gets the length using the following code:
> {code:java}
> int length = TypeSizes.encodedUTF8Length(str);{code}
> This in turn gets the length of the string based on a calculation:
> {noformat}
> public static int encodedUTF8Length(String st)
> {
> int strlen = st.length();
> int utflen = 0;
> for (int i = 0; i < strlen; i++)
> {
> int c = st.charAt(i);
> if ((c >= 0x0001) && (c <= 0x007F))
> utflen++;
> else if (c > 0x07FF)
> utflen += 3;
> else
> utflen += 2;
> }
> return utflen;
> }{noformat}
> The use of the st.length within this function causes problems - its 
> considering the string as utf-16, so the 4 byte UTF-8 value is treated as a 2 
> character utf-16 value, both of which are high values and considered to be 3 
> bytes in length each, making a total length of 6 bytes.
>  
> Using some test code:
> {noformat}
> import java.nio.charset.StandardCharsets;
> byte[] utf8Bytes = {(byte)244, (byte)143, (byte)191, (byte)191};
> var st = new String(utf8Bytes, StandardCharsets.UTF_8);
> System.out.println(st);
> int strlen = st.length();
> System.out.println(strlen);
> int utflen = 0;
> for (int i = 0; i < strlen; i++)
> {
>   int c = st.charAt(i);
>   if ((c >= 0x0001) && (c <= 0x007F))
>     utflen++;
>   else if (c > 0x07FF) {
>     utflen += 3;
>   }
>   else
>     utflen += 2;
> }
> System.out.println(utflen);
> byte[] utf8Bytes = st.getBytes(StandardCharsets.UTF_8);
> for (byte b : utf8Bytes) {
>   System.out.print(b & 0xFF);
>   System.out.printf(" ");
> }
> {noformat}
> The 4 byte UTF-8, is seen by st.length as 2, which then considered the value 
> of each utf-16  as 56319 and 57343 respectively, and since this is above the 
> 2047 (0x07FF),  adds 3 to the length each time.
> The response message at a byte level does correctly return the UTF-8 
> character in as 244 143 191 191, but the incorrect length results in a buffer 
> overread, which offsets the following reads, resulting in a few different 
> possible errors, but all relating to misalignment of the buffer read vs 
> expected value at that point in the buffer.
>  
> Issue specifically found in 4.1, but appears to have existed for a while - 
> and is specifically due to operating outside of the UTF-16 BMP range but into 
> the higher planes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to