[ https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16975536#comment-16975536 ]
Yifan Cai commented on CASSANDRA-15410: --------------------------------------- Good idea! I have updated the PR accordingly. The benchmark result indicates the new {{CBUtil.writeAsciiString}} method has a similar performance. {code:java} [java] Benchmark Mode Cnt Score Error Units [java] StringsEncodeBench.writeLongText avgt 6 570.920 ± 21.376 ns/op [java] StringsEncodeBench.writeLongTextAsASCII avgt 6 291.466 ± 9.508 ns/op [java] StringsEncodeBench.writeLongTextWithExactSize avgt 6 467.222 ± 25.140 ns/op [java] StringsEncodeBench.writeLongTextWithExactSizeSkipCalc avgt 6 285.320 ± 10.883 ns/op [java] StringsEncodeBench.writeShortText avgt 6 62.076 ± 2.107 ns/op [java] StringsEncodeBench.writeShortTextAsASCII avgt 6 32.121 ± 0.403 ns/op [java] StringsEncodeBench.writeShortTextWithExactSize avgt 6 41.929 ± 1.783 ns/op [java] StringsEncodeBench.writeShortTextWithExactSizeSkipCalc avgt 6 34.638 ± 0.455 ns/op {code} Changes included in the latest commit: * added {{CBUtil.writeAsciiString}}. * updated the code to call the new method whenever applicable, i.e. the string is certain to only contain char from (0, 127]. * added unit test cases for the new method. One thing to note is that {{ByteBufUtil#writeAscii()}} allows the extend 8-bits ASCII character, meanwhile, cassandra's ASCII data type follows the US-ASCII, which is 7-bits. If the input string contains any character with value > {{0x007F}}, {{ByteBufUtil#writeAscii()}} is going to encode it as 1 byte. However, cassandra's encoding considers it as 2 bytes. It leads to errors when reading the string. Client that calls {{CBUtil.writeAsciiString}} needs to be certain that the data type is ascii. I tried to add the validation step to check if the characters are US-ASCII compatible before writing, and add a boolean parameter to skip validation for the cases that I am sure the string is compatible. It ends up with all references of the method have the parameter set to skip, which looks slightly messy. I think for the places we want to use {{writeAsciiString}} over {{writeString}}, we already know the encoding of the string. So the parameter was omitted. Last but not the least, the root issue is that the {{ecodeSize}} is not a strong constraint when encoding. The encoding method cannot safely rely on the information. One improvement could be automate the calculation of the message size. Right now, the size was calculated manually by adding up the size of each field and *it is really error-prone*. Surely, it is outside the scope of this ticket. > Avoid over-allocation of bytes for UTF8 string serialization > ------------------------------------------------------------- > > Key: CASSANDRA-15410 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15410 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client > Reporter: Yifan Cai > Assignee: Yifan Cai > Priority: Normal > Fix For: 4.0 > > > In the current message encoding implementation, it first calculates the > `encodeSize` and allocates the bytebuffer with that size. > However, during encoding, it assumes the worst case of writing UTF8 string to > allocate bytes, i.e. assuming each letter takes 3 bytes. > The over-estimation further leads to resizing the underlying array and data > copy. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org