Re: [DISCUSS] Using ACCP or tc-native by default

2023-06-23 Thread German Eichberger via dev
+1 to ACCP - we love performance.

From: David Capwell 
Sent: Thursday, June 22, 2023 4:21 PM
To: dev 
Subject: [EXTERNAL] Re: [DISCUSS] Using ACCP or tc-native by default

+1 to ACCP

On Jun 22, 2023, at 3:05 PM, C. Scott Andreas  wrote:

+1 for ACCP and can attest to its results. ACCP also optimizes for a range of 
hash functions and other cryptographic primitives beyond TLS acceleration for 
Netty.

On Jun 22, 2023, at 2:07 PM, Jeff Jirsa  wrote:


Either would be better than today.

On Thu, Jun 22, 2023 at 1:57 PM Jordan West 
mailto:jw...@apache.org>> wrote:
Hi,

I’m wondering if there is appetite to change the default SSL provider for 
Cassandra going forward to either ACCP [1] or tc-native in Netty? Our 
deployment as well as others I’m aware of make this change in their fork and it 
can lead to significant performance improvement. When recently qualifying 4.1 
without using ACCP (by accident) we noticed p99 latencies were 2x higher than 
3.0 w/ ACCP. Wiring up ACCP can be a bit of a pain and also requires some 
amount of customization. I think it could be great for the wider community to 
adopt it.

The biggest hurdle I foresee is licensing but ACCP is Apache 2.0 licensed. 
Anything else I am missing before opening a JIRA and submitting a patch?

Jordan


[1]
https://github.com/corretto/amazon-corretto-crypto-provider




Re: [DISCUSS] Using ACCP or tc-native by default

2023-06-23 Thread Francisco Guerrero
Great addition! +1 (nb) 

On 2023/06/23 13:37:02 Josh McKenzie wrote:
>  +1 here on inclusion by default.
> 
> On Fri, Jun 23, 2023, at 2:01 AM, Dinesh Joshi wrote:
> > This would be a good addition and would make Cassandra more performant out 
> > of the box.
> > 
> > Dinesh
> > 
> >> On Jun 22, 2023, at 9:45 PM, Jordan West  wrote:
> >> 
> >> Glad to see there is support for this! I think ACCP would be a good choice 
> >> since there seems to be a lot of experience deploying it. I’ve opened 
> >> https://issues.apache.org/jira/browse/CASSANDRA-18624. I should have some 
> >> time to work on the patch soon and I will try to provide some graphs that 
> >> show the performance benefit from a recent benchmark.  
> >> 
> >> Jordan
> >> 
> >> 
> >> On Thu, Jun 22, 2023 at 19:28 Fleming, Jackson 
> >>  wrote:
> >>> We run ACCP in production on 1000s of nodes across Cassandra 3.11 and 4 
> >>> with great results.
> >>> __ __
> >>> Would love to see it baked into Cassandra.
> >>> *__ __*
> >>> Jackson
> >>> __ __
> >>> *From: *David Capwell 
> >>> *Date: *Friday, 23 June 2023 at 9:22 am
> >>> *To: *dev 
> >>> *Subject: *Re: [DISCUSS] Using ACCP or tc-native by default
> >>> *NetApp Security WARNING*: This is an external email. Do not click links 
> >>> or open attachments unless you recognize the sender and know the content 
> >>> is safe.
> >>> 
> >>> 
> >>> +1 to ACCP
> >>> 
> >>> 
>  On Jun 22, 2023, at 3:05 PM, C. Scott Andreas  
>  wrote:
>  __ __
>  +1 for ACCP and can attest to its results. ACCP also optimizes for a 
>  range of hash functions and other cryptographic primitives beyond TLS 
>  acceleration for Netty.
>  __ __
> > On Jun 22, 2023, at 2:07 PM, Jeff Jirsa  wrote:
> > __ __
> > __ __
> > Either would be better than today. 
> > __ __
> > On Thu, Jun 22, 2023 at 1:57 PM Jordan West  
> > wrote:
> >> Hi,
> >> __ __
> >> I’m wondering if there is appetite to change the default SSL provider 
> >> for Cassandra going forward to either ACCP [1] or tc-native in Netty? 
> >> Our deployment as well as others I’m aware of make this change in 
> >> their fork and it can lead to significant performance improvement. 
> >> When recently qualifying 4.1 without using ACCP (by accident) we 
> >> noticed p99 latencies were 2x higher than 3.0 w/ ACCP. Wiring up ACCP 
> >> can be a bit of a pain and also requires some amount of customization. 
> >> I think it could be great for the wider community to adopt it. 
> >> __ __
> >> The biggest hurdle I foresee is licensing but ACCP is Apache 2.0 
> >> licensed. Anything else I am missing before opening a JIRA and 
> >> submitting a patch?
> >> __ __
> >> Jordan 
> >> __ __
> >> __ __
> >> [1] 
> >> https://github.com/corretto/amazon-corretto-crypto-provider
>  __ __
> >>> __ __


Re: [DISCUSS] Being specific about JDK versions and python lib versions in CI

2023-06-23 Thread Brandon Williams
On Fri, Jun 23, 2023 at 8:36 AM Josh McKenzie  wrote:
> How frequently do we see failures in tests (straight failure, flakes, etc) 
> that are due to the versions of dependencies (JDK or python) that we're 
> running the tests with?
>
> Regardless of 1, if the answer to 2 is "almost never", then there's likely no 
> bridge to cross here.

For python, it's definitely in that category.  There aren't even that
many things that _could_ cause a failure, if the XML parser gets
upgraded it still works the same way.


Re: [DISCUSS] Being specific about JDK versions and python lib versions in CI

2023-06-23 Thread Ekaterina Dimitrova
   Sounds like you were assuming the effort would require 3 Ekaterina?

Thank you Josh.
Yes, you are totally right. I forgot for a moment that the new scripts are
layered. Valid point about 2.
Now the question is how fast 2 will be to run it and how many people would
actually use it for running full CI.

On Fri, 23 Jun 2023 at 9:37, Josh McKenzie  wrote:

> I'll take engagement. Even if it's broadly disagreement. :D
>
> (Full disclosure, I have a weak opinion here held weakly):
>
> Wouldn’t we recommend people to use the test images the project CI use?
> Thus using in testing the versions we use? I would assume the repeatable CI
> will still expect test images the way we have now?
>
> Not exactly no. Check out CASSANDRA-18137
> , specifically:
>
> Proposal
>
>- ...
>-
> *CI-agnostic build and test scripts can be run with docker, and without
>any CI, on any machine. *
>
>
> I've been discussing this with mck (and working with him on that patch)
> and realistically what we have are architectural layers for CI:
>
>1. ant
>2. build/test scripts
>3. dockerized build/test scripts (using scripts from 2; removes env
>and setup variability)
>4. CI integrations (using 3 or output of 2)
>5. CI lifecycle (full stack setup, run, teardown)
>
> The debate there would be whether we accept CI results from environments
> using 2 (build/test scripts), or we require environments to use 3 (docker
> images so it's all fully uniform). Originally mick and I were leaning
> towards allowing results from environments using 2 which would then
> introduce possible drift in OS, JDK, and python env, which is where my
> questions on this thread came in. Sounds like you were assuming the effort
> would require 3 Ekaterina?
>
>  My concern with this is atrophy; we'll set the version once and when
> finally forced to update
>
> *In theory* we could pretty easily mitigate this through automation. For
> instance, having a branch where we run the test suite with no dependencies
> declared (i.e. run all latest) that we run periodically, and then update
> the main branch to reflect the last good run from that "always run latest"
> branch, could give us a blending of both worlds.
>
> So technically feasible or not, I think there's a good question that's
> surfacing here of "what problem would we be trying to solve". I think that
> pivots on 2 axes:
>
>1. Do we allow folks to certify in a CI environment using build/test
>scripts but not using specific docker images we provide on the project?
>2. How frequently do we see failures in tests (straight failure,
>flakes, etc) that are due to the versions of dependencies (JDK or python)
>that we're running the tests with?
>
> Regardless of 1, if the answer to 2 is "almost never", then there's likely
> no bridge to cross here.
>
> To your point JD, our pre-reqs for C* installation in our docs explicitly
> call out "Run Latest JDK", so conforming to that recommendation in our CI
> environment makes sense to me (see here
> 
> ).
>
> On Thu, Jun 22, 2023, at 6:47 PM, Jeremy Hanna wrote:
>
>
> I like having the latest patch release always brought in for testing and
> CI for the JDK versions explicitly supported. So for this release, 11/17.
>
> On Jun 22, 2023, at 5:42 PM, Jeremiah Jordan 
> wrote:
>
> 
> Yes.  -1 on forcing the patch release version, and possibly minor version,
> for anything that is not absolutely necessary to do so.  Especially for
> things like Java or Python version, I would hope we just install the latest
> Java 8, Java 11, or Java 17 JDK for the platform the image is built from
> and run under them.  Otherwise we don’t find out until it’s too late when
> some JDK update breaks things.  I know I always tell users to run the
> latest JDK patch release, so we should do the same.
>
> If we want to pin the major version of something, then I have no problem
> there.
>
> -Jeremiah
>
> On Jun 22, 2023 at 5:00:36 PM, Ekaterina Dimitrova 
> wrote:
>
> Wouldn’t we recommend people to use the test images the project CI use?
> Thus using in testing the versions we use? I would assume the repeatable CI
> will still expect test images the way we have now?
> (I hope I did not misunderstand the question)
>
> I also share similar worries with Brandon
>
> On Thu, 22 Jun 2023 at 17:48, Brandon Williams  wrote:
>
> On Thu, Jun 22, 2023 at 4:23 PM Josh McKenzie 
> wrote:
> >
> > 2. Anyone concerned about us being specific about versions in
> requirements.txt in the dtest repo?
>
> My concern with this is atrophy; we'll set the version once and when
> finally forced to update, find that a lot of other things must also be
> updated in order to do so.  I think our current approach of only
> setting them on things we require at a certain version like thrift has
> been successful thus far, and I don't th

Re: Improved DeletionTime serialization to reduce disk size

2023-06-23 Thread Josh McKenzie
> If we’re doing this, why don’t we delta encode a vint from some per-sstable 
> minimum value? I’d expect that to commonly compress to a single byte or so.
+1 to this approach.

> Distant future people will not be happy about this, I can already tell you 
> now.
Eh, they'll all be AI's anyway and will just rewrite the code in a background 
thread.

On Fri, Jun 23, 2023, at 9:02 AM, Berenguer Blasi wrote:
> It's a possibility. Though I haven't coded and benchmarked such an 
> approach and I don't think I would have the time before the freeze to 
> take advantage of the sstable format change opportunity.
> 
> Still it's sthg that can be explored later. If we can shave a few extra 
> % then that would always be great imo.
> 
> On 23/6/23 13:57, Benedict wrote:
> > If we’re doing this, why don’t we delta encode a vint from some per-sstable 
> > minimum value? I’d expect that to commonly compress to a single byte or so.
> >
> >> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko  wrote:
> >>
> >> Distant future people will not be happy about this, I can already tell 
> >> you now.
> >>
> >> Sounds like a reasonable improvement to me however.
> >>
> >>> On 23 Jun 2023, at 07:22, Berenguer Blasi  
> >>> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I 
> >>> noticed that with 7 bytes we can already encode ~2284 years. We can 
> >>> either shed the 8th byte, for reduced IO and disk, or can encode some 
> >>> sentinel values (such as LIVE) as flags there. That would mean reading 
> >>> and writing 1 byte instead of 12 (8 mfda long + 4 ldts int). Yes we 
> >>> already avoid serializing DeletionTime (DT) in sstables at _row_ level 
> >>> entirely but not at _partition_ level and it is also serialized at index, 
> >>> metadata, etc.
> >>>
> >>> So here's a POC: 
> >>> https://github.com/bereng/cassandra/commits/ldtdeser-trunk and some jmh 
> >>> (1) to evaluate the impact of the new alg (2). It's tested here against a 
> >>> 70% and a 30% LIVE DTs  to see how we perform:
> >>>
> >>>  [java] Benchmark (liveDTPcParam)  (sstableParam)  Mode  Cnt  Score   
> >>> Error  Units
> >>>  [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive  
> >>> NC  avgt   15  0.331 ± 0.001  ns/op
> >>>  [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive  
> >>> OA  avgt   15  0.335 ± 0.004  ns/op
> >>>  [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive  
> >>> NC  avgt   15  0.334 ± 0.002  ns/op
> >>>  [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive  
> >>> OA  avgt   15  0.340 ± 0.008  ns/op
> >>>  [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive  
> >>> NC  avgt   15  0.337 ± 0.006  ns/op
> >>>  [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive  
> >>> OA  avgt   15  0.340 ± 0.004  ns/op
> >>>  [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive  
> >>> NC  avgt   15  0.339 ± 0.004  ns/op
> >>>  [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive  
> >>> OA  avgt   15  0.343 ± 0.016  ns/op
> >>>
> >>> That was ByteBuffer backed to test the extra bit level operations impact. 
> >>> But what would be the impact of an end to end test against disk?
> >>>
> >>>  [java] Benchmark (diskRAMParam)  (liveDTPcParam)  (sstableParam)  
> >>> Mode  Cnt ScoreError  Units
> >>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
> >>> 70PcLive  NC  avgt   15   605236.515 ± 19929.058  ns/op
> >>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
> >>> 70PcLive  OA  avgt   15   586477.039 ± 7384.632  ns/op
> >>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
> >>> 30PcLive  NC  avgt   15   937580.311 ± 30669.647  ns/op
> >>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 
> >>> 30PcLive  OA  avgt   15   914097.770 ± 9865.070  ns/op
> >>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT   Disk 
> >>> 70PcLive  NC  avgt   15  1314417.207 ± 37879.012  ns/op
> >>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT  Disk 
> >>> 70PcLive  OA  avgt   15 805256.345 ±  15471.587  ns/op
> >>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk  
> >>>30PcLive  NC  avgt   15 1583239.011 ±  50104.245  ns/op
> >>>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDTDisk   
> >>>   30PcLive  OA  avgt   15 1439605.006 ±  64342.510  ns/op
> >>>  [java] DeletionTimeDeSerBench.testE2ESerializeDT  RAM
> >>>  70PcLive  NC  avgt   15 295711.217 ±   5432.507  ns/op
> >>>  [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM 
> >>> 70PcLive  OA  avgt   15 305282.827 ±   1906.841  ns/op
> >>>  [java] DeletionTimeDeSerBench.testE2E

Re: [DISCUSS] Using ACCP or tc-native by default

2023-06-23 Thread Josh McKenzie
 +1 here on inclusion by default.

On Fri, Jun 23, 2023, at 2:01 AM, Dinesh Joshi wrote:
> This would be a good addition and would make Cassandra more performant out of 
> the box.
> 
> Dinesh
> 
>> On Jun 22, 2023, at 9:45 PM, Jordan West  wrote:
>> 
>> Glad to see there is support for this! I think ACCP would be a good choice 
>> since there seems to be a lot of experience deploying it. I’ve opened 
>> https://issues.apache.org/jira/browse/CASSANDRA-18624. I should have some 
>> time to work on the patch soon and I will try to provide some graphs that 
>> show the performance benefit from a recent benchmark.  
>> 
>> Jordan
>> 
>> 
>> On Thu, Jun 22, 2023 at 19:28 Fleming, Jackson  
>> wrote:
>>> We run ACCP in production on 1000s of nodes across Cassandra 3.11 and 4 
>>> with great results.
>>> __ __
>>> Would love to see it baked into Cassandra.
>>> *__ __*
>>> Jackson
>>> __ __
>>> *From: *David Capwell 
>>> *Date: *Friday, 23 June 2023 at 9:22 am
>>> *To: *dev 
>>> *Subject: *Re: [DISCUSS] Using ACCP or tc-native by default
>>> *NetApp Security WARNING*: This is an external email. Do not click links or 
>>> open attachments unless you recognize the sender and know the content is 
>>> safe.
>>> 
>>> 
>>> +1 to ACCP
>>> 
>>> 
 On Jun 22, 2023, at 3:05 PM, C. Scott Andreas  
 wrote:
 __ __
 +1 for ACCP and can attest to its results. ACCP also optimizes for a range 
 of hash functions and other cryptographic primitives beyond TLS 
 acceleration for Netty.
 __ __
> On Jun 22, 2023, at 2:07 PM, Jeff Jirsa  wrote:
> __ __
> __ __
> Either would be better than today. 
> __ __
> On Thu, Jun 22, 2023 at 1:57 PM Jordan West  wrote:
>> Hi,
>> __ __
>> I’m wondering if there is appetite to change the default SSL provider 
>> for Cassandra going forward to either ACCP [1] or tc-native in Netty? 
>> Our deployment as well as others I’m aware of make this change in their 
>> fork and it can lead to significant performance improvement. When 
>> recently qualifying 4.1 without using ACCP (by accident) we noticed p99 
>> latencies were 2x higher than 3.0 w/ ACCP. Wiring up ACCP can be a bit 
>> of a pain and also requires some amount of customization. I think it 
>> could be great for the wider community to adopt it. 
>> __ __
>> The biggest hurdle I foresee is licensing but ACCP is Apache 2.0 
>> licensed. Anything else I am missing before opening a JIRA and 
>> submitting a patch?
>> __ __
>> Jordan 
>> __ __
>> __ __
>> [1] 
>> https://github.com/corretto/amazon-corretto-crypto-provider
 __ __
>>> __ __

Re: [DISCUSS] Being specific about JDK versions and python lib versions in CI

2023-06-23 Thread Josh McKenzie
I'll take engagement. Even if it's broadly disagreement. :D

(Full disclosure, I have a weak opinion here held weakly):
> Wouldn’t we recommend people to use the test images the project CI use? Thus 
> using in testing the versions we use? I would assume the repeatable CI will 
> still expect test images the way we have now?
Not exactly no. Check out CASSANDRA-18137 
, specifically:
> Proposal
>  • ...
>  • *CI-agnostic build and test scripts can be run with docker, and without 
> any CI, on any machine.
*

I've been discussing this with mck (and working with him on that patch) and 
realistically what we have are architectural layers for CI:
 1. ant
 2. build/test scripts
 3. dockerized build/test scripts (using scripts from 2; removes env and setup 
variability)
 4. CI integrations (using 3 or output of 2)
 5. CI lifecycle (full stack setup, run, teardown)
The debate there would be whether we accept CI results from environments using 
2 (build/test scripts), or we require environments to use 3 (docker images so 
it's all fully uniform). Originally mick and I were leaning towards allowing 
results from environments using 2 which would then introduce possible drift in 
OS, JDK, and python env, which is where my questions on this thread came in. 
Sounds like you were assuming the effort would require 3 Ekaterina?

>  My concern with this is atrophy; we'll set the version once and when finally 
> forced to update
*In theory* we could pretty easily mitigate this through automation. For 
instance, having a branch where we run the test suite with no dependencies 
declared (i.e. run all latest) that we run periodically, and then update the 
main branch to reflect the last good run from that "always run latest" branch, 
could give us a blending of both worlds.

So technically feasible or not, I think there's a good question that's 
surfacing here of "what problem would we be trying to solve". I think that 
pivots on 2 axes:
 1. Do we allow folks to certify in a CI environment using build/test scripts 
but not using specific docker images we provide on the project?
 2. How frequently do we see failures in tests (straight failure, flakes, etc) 
that are due to the versions of dependencies (JDK or python) that we're running 
the tests with?
Regardless of 1, if the answer to 2 is "almost never", then there's likely no 
bridge to cross here.

To your point JD, our pre-reqs for C* installation in our docs explicitly call 
out "Run Latest JDK", so conforming to that recommendation in our CI 
environment makes sense to me (see here 
).

On Thu, Jun 22, 2023, at 6:47 PM, Jeremy Hanna wrote:
> 
> I like having the latest patch release always brought in for testing and CI 
> for the JDK versions explicitly supported. So for this release, 11/17.
> 
>> On Jun 22, 2023, at 5:42 PM, Jeremiah Jordan  
>> wrote:
>> 
>> Yes.  -1 on forcing the patch release version, and possibly minor version, 
>> for anything that is not absolutely necessary to do so.  Especially for 
>> things like Java or Python version, I would hope we just install the latest 
>> Java 8, Java 11, or Java 17 JDK for the platform the image is built from and 
>> run under them.  Otherwise we don’t find out until it’s too late when some 
>> JDK update breaks things.  I know I always tell users to run the latest JDK 
>> patch release, so we should do the same.
>> 
>> If we want to pin the major version of something, then I have no problem 
>> there.
>> 
>> -Jeremiah
>> 
>> On Jun 22, 2023 at 5:00:36 PM, Ekaterina Dimitrova  
>> wrote:
>>> Wouldn’t we recommend people to use the test images the project CI use? 
>>> Thus using in testing the versions we use? I would assume the repeatable CI 
>>> will still expect test images the way we have now? 
>>> (I hope I did not misunderstand the question)
>>> 
>>> I also share similar worries with Brandon
>>> 
>>> On Thu, 22 Jun 2023 at 17:48, Brandon Williams  wrote:
 On Thu, Jun 22, 2023 at 4:23 PM Josh McKenzie  wrote:
 >
 > 2. Anyone concerned about us being specific about versions in 
 > requirements.txt in the dtest repo?
 
 My concern with this is atrophy; we'll set the version once and when
 finally forced to update, find that a lot of other things must also be
 updated in order to do so.  I think our current approach of only
 setting them on things we require at a certain version like thrift has
 been successful thus far, and I don't think having different versions
 would be very common, but also not really affect repeatability if
 encountered.  You can see what versions are used from the logs though
 and could adjust them to be the same if necessary.


Re: Improved DeletionTime serialization to reduce disk size

2023-06-23 Thread Berenguer Blasi
It's a possibility. Though I haven't coded and benchmarked such an 
approach and I don't think I would have the time before the freeze to 
take advantage of the sstable format change opportunity.


Still it's sthg that can be explored later. If we can shave a few extra 
% then that would always be great imo.


On 23/6/23 13:57, Benedict wrote:

If we’re doing this, why don’t we delta encode a vint from some per-sstable 
minimum value? I’d expect that to commonly compress to a single byte or so.


On 23 Jun 2023, at 12:55, Aleksey Yeshchenko  wrote:

Distant future people will not be happy about this, I can already tell you now.

Sounds like a reasonable improvement to me however.


On 23 Jun 2023, at 07:22, Berenguer Blasi  wrote:

Hi all,

DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I 
noticed that with 7 bytes we can already encode ~2284 years. We can either shed 
the 8th byte, for reduced IO and disk, or can encode some sentinel values (such 
as LIVE) as flags there. That would mean reading and writing 1 byte instead of 
12 (8 mfda long + 4 ldts int). Yes we already avoid serializing DeletionTime 
(DT) in sstables at _row_ level entirely but not at _partition_ level and it is 
also serialized at index, metadata, etc.

So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk and 
some jmh (1) to evaluate the impact of the new alg (2). It's tested here 
against a 70% and a 30% LIVE DTs  to see how we perform:

 [java] Benchmark (liveDTPcParam)  (sstableParam)  Mode  Cnt  Score   Error 
 Units
 [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive  NC  
avgt   15  0.331 ± 0.001  ns/op
 [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive  OA  
avgt   15  0.335 ± 0.004  ns/op
 [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive  NC  
avgt   15  0.334 ± 0.002  ns/op
 [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive  OA  
avgt   15  0.340 ± 0.008  ns/op
 [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive  NC  
avgt   15  0.337 ± 0.006  ns/op
 [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive  OA  
avgt   15  0.340 ± 0.004  ns/op
 [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive  NC  
avgt   15  0.339 ± 0.004  ns/op
 [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive  OA  
avgt   15  0.343 ± 0.016  ns/op

That was ByteBuffer backed to test the extra bit level operations impact. But 
what would be the impact of an end to end test against disk?

 [java] Benchmark (diskRAMParam)  (liveDTPcParam)  (sstableParam)  Mode  
Cnt ScoreError  Units
 [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive
  NC  avgt   15   605236.515 ± 19929.058  ns/op
 [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive
  OA  avgt   15   586477.039 ± 7384.632  ns/op
 [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive
  NC  avgt   15   937580.311 ± 30669.647  ns/op
 [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive
  OA  avgt   15   914097.770 ± 9865.070  ns/op
 [java] DeletionTimeDeSerBench.testE2EDeSerializeDT   Disk 70PcLive 
 NC  avgt   15  1314417.207 ± 37879.012  ns/op
 [java] DeletionTimeDeSerBench.testE2EDeSerializeDT  Disk 
70PcLive  OA  avgt   15 805256.345 ±  15471.587  ns/op
 [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 
30PcLive  NC  avgt   15 1583239.011 ±  50104.245  ns/op
 [java] DeletionTimeDeSerBench.testE2EDeSerializeDTDisk 
30PcLive  OA  avgt   15 1439605.006 ±  64342.510  ns/op
 [java] DeletionTimeDeSerBench.testE2ESerializeDT  RAM 
70PcLive  NC  avgt   15 295711.217 ±   5432.507  ns/op
 [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM 
70PcLive  OA  avgt   15 305282.827 ±   1906.841  ns/op
 [java] DeletionTimeDeSerBench.testE2ESerializeDT  RAM 30PcLive 
 NC  avgt   15   446029.899 ±   4038.938  ns/op
 [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM 30PcLive   
   OA  avgt   15   479085.875 ± 10032.804  ns/op
 [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 70PcLive 
 NC  avgt   15  1789434.838 ± 206455.771  ns/op
 [java] DeletionTimeDeSerBench.testE2ESerializeDT   Disk 
70PcLive  OA  avgt   15 589752.861 ±  31676.265  ns/op
 [java] DeletionTimeDeSerBench.testE2ESerializeDTDisk 
30PcLive  NC  avgt   15 1754862.122 ± 164903.051  ns/op
 [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 30PcLive 
 OA  avgt   15  1252162.253 ± 121626.818  ns/o

We can see big improvements when backe

Re: Improved DeletionTime serialization to reduce disk size

2023-06-23 Thread Benedict
If we’re doing this, why don’t we delta encode a vint from some per-sstable 
minimum value? I’d expect that to commonly compress to a single byte or so.

> On 23 Jun 2023, at 12:55, Aleksey Yeshchenko  wrote:
> 
> Distant future people will not be happy about this, I can already tell you 
> now.
> 
> Sounds like a reasonable improvement to me however.
> 
>> On 23 Jun 2023, at 07:22, Berenguer Blasi  wrote:
>> 
>> Hi all,
>> 
>> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I 
>> noticed that with 7 bytes we can already encode ~2284 years. We can either 
>> shed the 8th byte, for reduced IO and disk, or can encode some sentinel 
>> values (such as LIVE) as flags there. That would mean reading and writing 1 
>> byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid 
>> serializing DeletionTime (DT) in sstables at _row_ level entirely but not at 
>> _partition_ level and it is also serialized at index, metadata, etc.
>> 
>> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk 
>> and some jmh (1) to evaluate the impact of the new alg (2). It's tested here 
>> against a 70% and a 30% LIVE DTs  to see how we perform:
>> 
>> [java] Benchmark (liveDTPcParam)  (sstableParam)  Mode  Cnt  Score   
>> Error  Units
>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive  NC  
>> avgt   15  0.331 ± 0.001  ns/op
>> [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive  OA  
>> avgt   15  0.335 ± 0.004  ns/op
>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive  NC  
>> avgt   15  0.334 ± 0.002  ns/op
>> [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive  OA  
>> avgt   15  0.340 ± 0.008  ns/op
>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive  NC  
>> avgt   15  0.337 ± 0.006  ns/op
>> [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive  OA  
>> avgt   15  0.340 ± 0.004  ns/op
>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive  NC  
>> avgt   15  0.339 ± 0.004  ns/op
>> [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive  OA  
>> avgt   15  0.343 ± 0.016  ns/op
>> 
>> That was ByteBuffer backed to test the extra bit level operations impact. 
>> But what would be the impact of an end to end test against disk?
>> 
>> [java] Benchmark (diskRAMParam)  (liveDTPcParam)  (sstableParam)  Mode  
>> Cnt ScoreError  Units
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive  
>> NC  avgt   15   605236.515 ± 19929.058  ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive  
>> OA  avgt   15   586477.039 ± 7384.632  ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive  
>> NC  avgt   15   937580.311 ± 30669.647  ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive  
>> OA  avgt   15   914097.770 ± 9865.070  ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT   Disk 
>> 70PcLive  NC  avgt   15  1314417.207 ± 37879.012  ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT  Disk 
>> 70PcLive  OA  avgt   15 805256.345 ±  15471.587  ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 
>> 30PcLive  NC  avgt   15 1583239.011 ±  50104.245  ns/op
>> [java] DeletionTimeDeSerBench.testE2EDeSerializeDTDisk 
>> 30PcLive  OA  avgt   15 1439605.006 ±  64342.510  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT  RAM 
>> 70PcLive  NC  avgt   15 295711.217 ±   5432.507  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM 
>> 70PcLive  OA  avgt   15 305282.827 ±   1906.841  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT  RAM 
>> 30PcLive  NC  avgt   15   446029.899 ±   4038.938  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM 30PcLive 
>>  OA  avgt   15   479085.875 ± 10032.804  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 
>> 70PcLive  NC  avgt   15  1789434.838 ± 206455.771  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT   Disk 
>> 70PcLive  OA  avgt   15 589752.861 ±  31676.265  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDTDisk 
>> 30PcLive  NC  avgt   15 1754862.122 ± 164903.051  ns/op
>> [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 
>> 30PcLive  OA  avgt   15  1252162.253 ± 121626.818  ns/o
>> 
>> We can see big improvements when backed with the disk and little impact from 
>> the new alg.
>> 
>> Given we're already introducing a new sstable format (OA) in 5.0 I would 
>

Re: Improved DeletionTime serialization to reduce disk size

2023-06-23 Thread Aleksey Yeshchenko
Distant future people will not be happy about this, I can already tell you now.

Sounds like a reasonable improvement to me however.

> On 23 Jun 2023, at 07:22, Berenguer Blasi  wrote:
> 
> Hi all,
> 
> DeletionTime.markedForDeleteAt is a long useconds since Unix Epoch. But I 
> noticed that with 7 bytes we can already encode ~2284 years. We can either 
> shed the 8th byte, for reduced IO and disk, or can encode some sentinel 
> values (such as LIVE) as flags there. That would mean reading and writing 1 
> byte instead of 12 (8 mfda long + 4 ldts int). Yes we already avoid 
> serializing DeletionTime (DT) in sstables at _row_ level entirely but not at 
> _partition_ level and it is also serialized at index, metadata, etc.
> 
> So here's a POC: https://github.com/bereng/cassandra/commits/ldtdeser-trunk 
> and some jmh (1) to evaluate the impact of the new alg (2). It's tested here 
> against a 70% and a 30% LIVE DTs  to see how we perform:
> 
>  [java] Benchmark (liveDTPcParam)  (sstableParam)  Mode  Cnt  Score   
> Error  Units
>  [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive  NC  
> avgt   15  0.331 ± 0.001  ns/op
>  [java] DeletionTimeDeSerBench.testRawAlgReads 70PcLive  OA  
> avgt   15  0.335 ± 0.004  ns/op
>  [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive  NC  
> avgt   15  0.334 ± 0.002  ns/op
>  [java] DeletionTimeDeSerBench.testRawAlgReads 30PcLive  OA  
> avgt   15  0.340 ± 0.008  ns/op
>  [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive  NC  
> avgt   15  0.337 ± 0.006  ns/op
>  [java] DeletionTimeDeSerBench.testNewAlgWrites 70PcLive  OA  
> avgt   15  0.340 ± 0.004  ns/op
>  [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive  NC  
> avgt   15  0.339 ± 0.004  ns/op
>  [java] DeletionTimeDeSerBench.testNewAlgWrites 30PcLive  OA  
> avgt   15  0.343 ± 0.016  ns/op
> 
> That was ByteBuffer backed to test the extra bit level operations impact. But 
> what would be the impact of an end to end test against disk?
> 
>  [java] Benchmark (diskRAMParam)  (liveDTPcParam)  (sstableParam)  Mode  
> Cnt ScoreError  Units
>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive  
> NC  avgt   15   605236.515 ± 19929.058  ns/op
>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 70PcLive  
> OA  avgt   15   586477.039 ± 7384.632  ns/op
>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive  
> NC  avgt   15   937580.311 ± 30669.647  ns/op
>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT RAM 30PcLive  
> OA  avgt   15   914097.770 ± 9865.070  ns/op
>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT   Disk 
> 70PcLive  NC  avgt   15  1314417.207 ± 37879.012  ns/op
>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT  Disk 
> 70PcLive  OA  avgt   15 805256.345 ±  15471.587  ns/op
>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDT Disk 
> 30PcLive  NC  avgt   15 1583239.011 ±  50104.245  ns/op
>  [java] DeletionTimeDeSerBench.testE2EDeSerializeDTDisk 
> 30PcLive  OA  avgt   15 1439605.006 ±  64342.510  ns/op
>  [java] DeletionTimeDeSerBench.testE2ESerializeDT  RAM 
> 70PcLive  NC  avgt   15 295711.217 ±   5432.507  ns/op
>  [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM 
> 70PcLive  OA  avgt   15 305282.827 ±   1906.841  ns/op
>  [java] DeletionTimeDeSerBench.testE2ESerializeDT  RAM 
> 30PcLive  NC  avgt   15   446029.899 ±   4038.938  ns/op
>  [java] DeletionTimeDeSerBench.testE2ESerializeDTRAM 30PcLive 
>  OA  avgt   15   479085.875 ± 10032.804  ns/op
>  [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 
> 70PcLive  NC  avgt   15  1789434.838 ± 206455.771  ns/op
>  [java] DeletionTimeDeSerBench.testE2ESerializeDT   Disk 
> 70PcLive  OA  avgt   15 589752.861 ±  31676.265  ns/op
>  [java] DeletionTimeDeSerBench.testE2ESerializeDTDisk 
> 30PcLive  NC  avgt   15 1754862.122 ± 164903.051  ns/op
>  [java] DeletionTimeDeSerBench.testE2ESerializeDT Disk 
> 30PcLive  OA  avgt   15  1252162.253 ± 121626.818  ns/o
> 
> We can see big improvements when backed with the disk and little impact from 
> the new alg.
> 
> Given we're already introducing a new sstable format (OA) in 5.0 I would like 
> to try to get this in before the freeze. The point being that sstables with 
> lots of small partitions would benefit from a smaller DT at partition level. 
> My tests show a 3%-4% size reduction on disk.
> 
> Before proceeding though I'd like to bounce the idea agains

Re: [DISCUSS] Allow UPDATE on settings virtual table to change running configuration

2023-06-23 Thread Maxim Muzafarov
Hello everyone,


As there is a lack of feedback for an option to go on with and having
a discussion for pros and cons for each option I tend to agree with
the vision of this problem proposed by David :-) After a lot of
discussion on Slack, we came to the @ValidatedBy annotation which
points to a validation method of a property and this will address all
our concerns and issues with validation.

I'd like to raise the visibility of these changes and try to find one
more committer to look at them:
https://issues.apache.org/jira/browse/CASSANDRA-15254
https://github.com/apache/cassandra/pull/2334/files

I'd really appreciate any kind of review in advance.


Despite the number of changes +2,043 −302 and the fact that most of
these additions are related to the tests themselves, I would like to
highlight the crucial design points which are required to make the
SettingsTable virtual table updatable. Some of these have already been
discussed in this thread, and I would like to provide a brief outline
of these points to facilitate the PR review.

So, what are the problems that have been solved to make the
SettingsTable updatable?

1. Input validation.

Currently, the JMX, Yaml and DatabaseDescriptor#apply methods perform
the same validation of user input for the same property in their own
ways which fortunately results in a consistent configuration state,
but not always. The CASSANDRA-17734 is a good example of this.

The @ValidatedBy annotations, which point to a validation method have
been added to address this particular problem. So, no matter what API
is triggered the method will be called to validate input and will also
work even if the cassandra.yaml is loaded by the yaml engine in a
pre-parse state, such as we are now checking input properties for
deprecation and nullability.

There are two types of validation worth mentioning:
- stateless - properties do not depend on any other configuration;
- stateful - properties that require a fully-constructed Config
instance to be validated and those values depend on other properties;

For the sake of simplicity, the scope of this task will be limited to
dealing with stateless properties only, but stateful validations are
also supported in the initial PR using property change listeners.

2. Property mutability.

There is no way of distinguishing which parts of a property are
mutable and which are not. This meta-information must be available at
runtime and as we discussed earlier the @Mutable annotation is added
to handle this.

3. Listening for property changes.

Some of the internal components e.g. CommitLog, may perform some
operations and/or calculations just before or just after the property
change. As long as JMX is the only API used to update configuration
properties, there is no problem. To address this issue the observer
pattern has been used to maintain the same behaviour.

4. SettingsTable input/output format.

JMX, SettingsTable and Yaml accept values in different formats which
may not be compatible in some of the cases especially when
representing composite objects. The former uses toString() as an
output, and the latter uses a yaml human-readable format.

So, in order to see the same properties in the same format through
different APIs, the Yaml representation is reused to display the
values and to parse a user input in case of update/set operations.

Although the output format between APIs matches in the vast majority
of cases here is the list of configuration properties that do not
match:
- memtable.configurations
- sstable_formats
- seed_provider.parameters
- data_file_directories

The test illustrates the problem:
https://github.com/apache/cassandra/pull/2334/files#diff-e94bb80f12622412fff9d87b58733e0549ba3024a54714516adc8bc70709933bR319

This could be a regression as the output format is changed, but this
seems to be the correct change to go with. We can keep it as is, or
create SettingsTableV2, which seems to be unlikely.

The examples of the format change:
-
Property 'seed_provider.parameters' expected:
 {seeds=127.0.0.1:7012}
Property 'seed_provider.parameters' actual:
 seeds: 127.0.0.1:7012
-
Property 'data_file_directories' expected:
 [Ljava.lang.String;@436813f3
Property 'data_file_directories' actual:
 [build/test/cassandra/data]
-

On Mon, 1 May 2023 at 15:11, Maxim Muzafarov  wrote:
>
> Hello everyone,
>
>
> I want to continue this topic and share another properties validation
> option/solution that emerged from my investigation of Cassandra and
> Accord configuration that could be used to make the virtual table
> SettingTable updatable, as each update must move Config from one
> consistent state to another. The solution is based on a few
> assumptions: we don't frequently update the running configuration, and
> we want to base a solution on established Cassandra validation
> approaches to minimise the impact on