[ https://issues.apache.org/jira/browse/CASSANDRA-8959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570888#comment-17570888 ]
Claude Warren edited comment on CASSANDRA-8959 at 7/25/22 1:11 PM: ------------------------------------------------------------------- It has been awhile since this was looked at and I am just getting familiar with the code base. But reading this issue and looking at UserTypeSerializer it looks like the issue is that the UserType (and other collections?) can have `null` entries and the issue is that for sparse collections the serialisation is not efficient. I think that there are 3 reasonable serialisation techniques: # The current serialisation for non-sparse values # A bitmap where each bit represents the presence/absence of a specific indices. # A list of integers specifying the presence of specific indices. The bitmap could be structured as # int length # byte[length] # Standard serialised data elements The array of integers could be structured as # int length # int[length] # Standard serialised data elements In the current serialisation the first item is an integer count of the number of items. Since there can not be a negative count we could set a flag as the left most bit. so if the number is negative then the first byte (b) is an encoding descriptor where {{if (0x80 & b) == 0 then [standard processing]}} {{if b == 0x80 then [bitmap processing]}} {{if b == 0xC0 then [array processing]}} {{If we limit the number of items in the structures to 2^30-1 (1,073,741,823) then we can calculate the length for the bitmap and array as int length = access.getInt( value ) & 0x7FFFFFFF }} If n is the number of values in the collection and p is the number of values that are non-null (populated) then using array encoding is more efficient if n/32 > p was (Author: claudenw): It has been awhile since this was looked at and I am just getting familiar with the code base. But reading this issue and looking at UserTypeSerializer it looks like the issue is that the UserType (and other collections?) can have `null` entries and the issue is that for sparse collections the serialisation is not efficient. I think that there are 3 reasonable serialisation techniques: # The current serialisation for non-sparse values # A bitmap where each bit represents the presence/absence of a specific indices. # A list of integers specifying the presence of specific indices. The bitmap could be structured as # int length # byte[length] # Standard serialised data elements The array of integers could be structured as # int length # int[length] # Standard serialised data elements In the current serialisation the first item is an integer count of the number of items. Since there can not be a negative count we could set a flag as the left most bit. so if the number is negative then the first byte (b) is an encoding descriptor where {{if (0x80 & b) == 0 then [standard processing]}} {{if b == 0x80 then [bitmap processing]}} {{if b == 0xC0 then [array processing]}} {{If we limit the number of items in the structures to 2^30-1 (1,073,741,823) then we can calculate the length for the bitmap and array as int length = access.getInt( value ) & 0x7FFFFFFF }} > More efficient frozen UDT, tuple and collection serialization format > -------------------------------------------------------------------- > > Key: CASSANDRA-8959 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8959 > Project: Cassandra > Issue Type: Improvement > Components: Legacy/Core > Reporter: Aleksey Yeschenko > Priority: Normal > Labels: performance > Fix For: 4.x > > > The current serialization format for UDTs has a fixed overhead of 4 bytes per > defined field (encoding the size of the field). > It is inefficient for sparse UDTs - ones with many defined fields, but few of > them present. We could keep a bitset to indicate the missing fields, if any. > It's sub-optimal for encoding UDTs with all the values present as well. We > could use varint encoding for the field sizes of blob/text fields and encode > 'fixed' sized types directly, without the 4-bytes size prologue. > That or something more brilliant. Any improvement right now is lhf. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org