[ 
https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16975536#comment-16975536
 ] 

Yifan Cai commented on CASSANDRA-15410:
---------------------------------------

Good idea! I have updated the PR accordingly. The benchmark result indicates 
the new {{CBUtil.writeAsciiString}} method has a similar performance. 


{code:java}
[java] Benchmark                                               Mode  Cnt    
Score    Error  Units
[java] StringsEncodeBench.writeLongText                        avgt    6  
570.920 ± 21.376  ns/op
[java] StringsEncodeBench.writeLongTextAsASCII                 avgt    6  
291.466 ±  9.508  ns/op
[java] StringsEncodeBench.writeLongTextWithExactSize           avgt    6  
467.222 ± 25.140  ns/op
[java] StringsEncodeBench.writeLongTextWithExactSizeSkipCalc   avgt    6  
285.320 ± 10.883  ns/op
[java] StringsEncodeBench.writeShortText                       avgt    6   
62.076 ±  2.107  ns/op
[java] StringsEncodeBench.writeShortTextAsASCII                avgt    6   
32.121 ±  0.403  ns/op
[java] StringsEncodeBench.writeShortTextWithExactSize          avgt    6   
41.929 ±  1.783  ns/op
[java] StringsEncodeBench.writeShortTextWithExactSizeSkipCalc  avgt    6   
34.638 ±  0.455  ns/op
{code}

Changes included in the latest commit:
* added {{CBUtil.writeAsciiString}}.
* updated the code to call the new method whenever applicable, i.e. the string 
is certain to only contain char from (0, 127].
* added unit test cases for the new method.

One thing to note is that {{ByteBufUtil#writeAscii()}} allows the extend 8-bits 
ASCII character, meanwhile, cassandra's ASCII data type follows the US-ASCII, 
which is 7-bits. If the input string contains any character with value > 
{{0x007F}}, {{ByteBufUtil#writeAscii()}} is going to encode it as 1 byte. 
However, cassandra's encoding considers it as 2 bytes. It leads to errors when 
reading the string. 
Client that calls {{CBUtil.writeAsciiString}} needs to be certain that the data 
type is ascii. 
I tried to add the validation step to check if the characters are US-ASCII 
compatible before writing, and add a boolean parameter to skip validation for 
the cases that I am sure the string is compatible. It ends up with all 
references of the method have the parameter set to skip, which looks slightly 
messy. 
I think for the places we want to use {{writeAsciiString}} over 
{{writeString}}, we already know the encoding of the string. So the parameter 
was omitted.

Last but not the least, the root issue is that the {{ecodeSize}} is not a 
strong constraint when encoding. The encoding method cannot safely rely on the 
information. One improvement could be automate the calculation of the message 
size. Right now, the size was calculated manually by adding up the size of each 
field and *it is really error-prone*. Surely, it is outside the scope of this 
ticket. 

> Avoid over-allocation of bytes for UTF8 string serialization 
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-15410
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15410
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Messaging/Client
>            Reporter: Yifan Cai
>            Assignee: Yifan Cai
>            Priority: Normal
>             Fix For: 4.0
>
>
> In the current message encoding implementation, it first calculates the 
> `encodeSize` and allocates the bytebuffer with that size. 
> However, during encoding, it assumes the worst case of writing UTF8 string to 
> allocate bytes, i.e. assuming each letter takes 3 bytes. 
> The over-estimation further leads to resizing the underlying array and data 
> copy. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to