[jira] [Comment Edited] (CASSANDRA-8959) More efficient frozen UDT, tuple and collection serialization format

Claude Warren (Jira) Mon, 25 Jul 2022 06:12:10 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-8959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570888#comment-17570888
 ]


Claude Warren edited comment on CASSANDRA-8959 at 7/25/22 1:11 PM:
-------------------------------------------------------------------

It has been awhile since this was looked at and I am just getting familiar with 
the code base.  But reading this issue and looking at UserTypeSerializer it 
looks like the issue is that the UserType (and other collections?) can have 
`null` entries and the issue is that for sparse collections the serialisation 
is not efficient.

I think that there are 3 reasonable serialisation techniques:
 # The current serialisation for non-sparse values
 # A bitmap where each bit represents the presence/absence of a specific 
indices.
 # A list of integers specifying the presence of specific indices.

 

The bitmap could be structured as
 # int length
 # byte[length]
 # Standard serialised data elements

The array of integers could be structured as
 # int length
 # int[length]
 # Standard serialised data elements

In the current serialisation the first item is an integer count of the number 
of items.  Since there can not be a negative count we could set a flag as the 
left most bit.  so if the number is negative then the first byte (b) is an 
encoding descriptor where 

{{if (0x80 & b) == 0 then [standard processing]}}

{{if b == 0x80 then [bitmap processing]}}

{{if b == 0xC0 then [array processing]}}

{{If we limit the number of items in the structures to 2^30-1 (1,073,741,823) 
then we can calculate the length for the bitmap and array as int length = 
access.getInt( value ) & 0x7FFFFFFF  }}

 

If n is the number of values in the collection and p is the number of values 
that are non-null (populated) then using array encoding is more efficient if 
n/32 > p  


was (Author: claudenw):
It has been awhile since this was looked at and I am just getting familiar with 
the code base.  But reading this issue and looking at UserTypeSerializer it 
looks like the issue is that the UserType (and other collections?) can have 
`null` entries and the issue is that for sparse collections the serialisation 
is not efficient.

I think that there are 3 reasonable serialisation techniques:
 # The current serialisation for non-sparse values
 # A bitmap where each bit represents the presence/absence of a specific 
indices.
 # A list of integers specifying the presence of specific indices.

 

The bitmap could be structured as
 # int length
 # byte[length]
 # Standard serialised data elements

The array of integers could be structured as
 # int length
 # int[length]
 # Standard serialised data elements

In the current serialisation the first item is an integer count of the number 
of items.  Since there can not be a negative count we could set a flag as the 
left most bit.  so if the number is negative then the first byte (b) is an 
encoding descriptor where 

{{if (0x80 & b) == 0 then [standard processing]}}

{{if b == 0x80 then [bitmap processing]}}

{{if b == 0xC0 then [array processing]}}

{{If we limit the number of items in the structures to 2^30-1 (1,073,741,823) 
then we can calculate the length for the bitmap and array as int length = 
access.getInt( value ) & 0x7FFFFFFF  }}

 

  

> More efficient frozen UDT, tuple and collection serialization format
> --------------------------------------------------------------------
>
>                 Key: CASSANDRA-8959
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8959
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Legacy/Core
>            Reporter: Aleksey Yeschenko
>            Priority: Normal
>              Labels: performance
>             Fix For: 4.x
>
>
> The current serialization format for UDTs has a fixed overhead of 4 bytes per 
> defined field (encoding the size of the field).
> It is inefficient for sparse UDTs - ones with many defined fields, but few of 
> them present. We could keep a bitset to indicate the missing fields, if any.
> It's sub-optimal for encoding UDTs with all the values present as well. We 
> could use varint encoding for the field sizes of blob/text fields and encode 
> 'fixed' sized types directly, without the 4-bytes size prologue.
> That or something more brilliant. Any improvement right now is lhf.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-8959) More efficient frozen UDT, tuple and collection serialization format

Reply via email to