I have looked through the code mentioned.  What I found in the ColumnSerializer was the use of VInt encoding.  Are you proposing switching directly to VInt encoding for sizes rather than one of the other encodings?  Using a -2 as the first length to signal that the new encoding is in use so that existing encodings can be read unchanged?

On 06/09/2022 16:37, Benedict wrote:
So, looking more closely at your proposal I realise what you are trying to do. 
The thing that threw me was your mention of lists and other collections. This 
will likely not work as there is no index that is possible to define on a list 
(or other collection) within a single sstable - a list is defined over the 
whole on-disk contents, so the index is undefined within a given sstable.

Tuple and UDT are encoded inefficiently if there are many null fields, but this 
is a very localised change, affecting just one class. You should take a look at 
Columns.Serializer for code you can lift for encoding and decoding sparse 
subsets of fields.

It might be that this can be switched on or off per sstable with a header flag 
bit so that there is no additional cost for datasets that would not benefit. 
Likely we can also migrate to vint encoding for the component sizes also (and 
either 1 or 0 bytes for fixed width values), no doubt saving a lot of space 
over the status quo, even for small UDT with few null entries.

Essentially at this point we’re talking about pushing through storage 
optimisations applied elsewhere to tuples and UDT, which is a very 
uncontroversial change.

On 6 Sep 2022, at 07:28, Benedict <benedictatapa...@icloud.com> wrote:

I agree a Jira would suffice, and if visibility there required a DISCUSS 
thread or simply a notice sent to the list.

While we’re here though, while I don’t have a lot of time to engage in 
discussion it’s unclear to me what advantage this encoding scheme brings. It 
might be worth outlining what algorithmic advantage you foresee for what data 
distributions in which collection types.

On 6 Sep 2022, at 07:16, Claude Warren via dev <dev@cassandra.apache.org> wrote:

I am just learning the ropes here so perhaps it is not CEP worthy.  That being 
said, It felt like there was a lot of information to put into and track in a 
ticket, particularly when I expected discussion about how to best encode, 
changes to the algorithms etc.  It feels like it would be difficult to track. 
But if that is standard for this project I will move the information there.

As to the benchmarking, I had thought that usage and performance measures 
should be included.  Thank you for calling out the subset of data selected 
query as being of particular importance.

Claude

On 06/09/2022 03:11, Abe Ratnofsky wrote:
Looking at this link: 
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-23%3A++Enhancement+for+Sparse+Data+Serialization

Do you have any plans to include benchmarks in your test plan? It would be 
useful to include disk usage / read performance / write performance comparisons 
with the new encodings, particularly for sparse collections where a subset of 
data is selected out of a collection.

I do wonder whether this is CEP-worthy. The CEP says that the changes will not 
impact existing users, will be backwards compatible, and overall is an 
efficiency improvement. The CEP guidelines say a CEP is encouraged “for 
significant user-facing or changes that cut across multiple subsystems”. Any 
reason why a Jira isn’t sufficient?

Abe

On Sep 5, 2022, at 1:57 AM, Claude Warren via dev <dev@cassandra.apache.org> 
wrote:
I have just posted a CEP  covering an Enhancement for Sparse Data Serialzation. 
 This is in response to CASSANDRA-8959

I look forward to responses.


Reply via email to