I have looked through the code mentioned. What I found in the
ColumnSerializer was the use of VInt encoding. Are you proposing
switching directly to VInt encoding for sizes rather than one of the
other encodings? Using a -2 as the first length to signal that the new
encoding is in use so that existing encodings can be read unchanged?
On 06/09/2022 16:37, Benedict wrote:
So, looking more closely at your proposal I realise what you are trying to do.
The thing that threw me was your mention of lists and other collections. This
will likely not work as there is no index that is possible to define on a list
(or other collection) within a single sstable - a list is defined over the
whole on-disk contents, so the index is undefined within a given sstable.
Tuple and UDT are encoded inefficiently if there are many null fields, but this
is a very localised change, affecting just one class. You should take a look at
Columns.Serializer for code you can lift for encoding and decoding sparse
subsets of fields.
It might be that this can be switched on or off per sstable with a header flag
bit so that there is no additional cost for datasets that would not benefit.
Likely we can also migrate to vint encoding for the component sizes also (and
either 1 or 0 bytes for fixed width values), no doubt saving a lot of space
over the status quo, even for small UDT with few null entries.
Essentially at this point we’re talking about pushing through storage
optimisations applied elsewhere to tuples and UDT, which is a very
uncontroversial change.
On 6 Sep 2022, at 07:28, Benedict <benedictatapa...@icloud.com> wrote:
I agree a Jira would suffice, and if visibility there required a DISCUSS
thread or simply a notice sent to the list.
While we’re here though, while I don’t have a lot of time to engage in
discussion it’s unclear to me what advantage this encoding scheme brings. It
might be worth outlining what algorithmic advantage you foresee for what data
distributions in which collection types.
On 6 Sep 2022, at 07:16, Claude Warren via dev <dev@cassandra.apache.org> wrote:
I am just learning the ropes here so perhaps it is not CEP worthy. That being
said, It felt like there was a lot of information to put into and track in a
ticket, particularly when I expected discussion about how to best encode,
changes to the algorithms etc. It feels like it would be difficult to track.
But if that is standard for this project I will move the information there.
As to the benchmarking, I had thought that usage and performance measures
should be included. Thank you for calling out the subset of data selected
query as being of particular importance.
Claude
On 06/09/2022 03:11, Abe Ratnofsky wrote:
Looking at this link:
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-23%3A++Enhancement+for+Sparse+Data+Serialization
Do you have any plans to include benchmarks in your test plan? It would be
useful to include disk usage / read performance / write performance comparisons
with the new encodings, particularly for sparse collections where a subset of
data is selected out of a collection.
I do wonder whether this is CEP-worthy. The CEP says that the changes will not
impact existing users, will be backwards compatible, and overall is an
efficiency improvement. The CEP guidelines say a CEP is encouraged “for
significant user-facing or changes that cut across multiple subsystems”. Any
reason why a Jira isn’t sufficient?
Abe
On Sep 5, 2022, at 1:57 AM, Claude Warren via dev <dev@cassandra.apache.org>
wrote:
I have just posted a CEP covering an Enhancement for Sparse Data Serialzation.
This is in response to CASSANDRA-8959
I look forward to responses.