Re: [DISCUSS] Should separate snapshots with the same name be allowed in different tables?

2023-03-07 Thread Miklosovic, Stefan
I agree too. Given the fact that the method checking the uniqueness of a 
snapshot name was implemented first, it seems to me that the second method 
which is not checking it just forgot to do that rather than intentionally doing 
it like that.

Is it right to assume that we need to address this first in order to implement 
18102? If we leave it as it is and we implement vtable with "unique snapshot 
name globally" in mind and we design that vtable like that, until we fix this 
issue, snapshots would be overwritten on top of each other if their names are 
same.

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L4266
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L4317


From: Paulo Motta 
Sent: Monday, March 6, 2023 5:59
To: Cassandra DEV
Subject: [DISCUSS] Should separate snapshots with the same name be allowed in 
different tables?

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



Hi,

It's possible to create a snapshot on a set of tables with the same name 
name/tag with:
$ nodetool snapshot --kt-list ks.tb,system.local -t mysnapshot
$ nodetool listsnapshots
[..]
mysnapshot  system  local  1.16 KiB  21.47 KiB2023-03-02T13:19:13.757Z
mysnapshot  ks  tb 1.02 KiB   6.08 KiB2023-03-02T13:19:13.757Z

It's also possible to create a snapshot with an existing name, as long as it's 
in a different table:
$ nodetool snapshot --kt-list ks.tb2 -t mysnapshot
$ nodetool listsnapshots
[..]
mysnapshot ks   tb21.16 KiB  21.47 KiB
2023-03-02T13:19:42.140Z
mysnapshot ks   tb 1.02 KiB  6.08 KiB 
2023-03-02T13:19:13.757Z
mysnapshot system   local  107 bytes 6.98 KiB 
2023-03-02T13:19:13.757Z

I found this a bit surprising, since I would expect a snapshot with the same 
name on different tables to represent the same logical snapshot taken at the 
same point-in-time.

This affects the design of the snapshots virtual table schema on 
CASSANDRA-18102[1] because the timestamp needs to be included in the primary 
key to allow distinguishing between different snapshots with the same name, 
which seems a bit counter-intuitive.

I think we should disallow creating a snapshot if another snapshot already 
exists with the same name in any other table, so snapshots with the same name 
on different tables are guaranteed to belong to the same logical point-in-time 
snapshot.

Does anyone think it's useful to be able to create separate snapshots with the 
same names across different tables and we should retain this support?

Let me know what do you think,

Paulo

[1] - https://issues.apache.org/jira/browse/CASSANDRA-18102



Re: [EXTERNAL] [DISCUSS] Next release date

2023-03-07 Thread Sam Tunnicliffe
Currently, we anticipate CEP-21 being in a mergeable state around late 
July/August. We're intending to push a feature branch with an implementation 
covering much of the core functionality to the ASF repo this week. Doing so 
will obviously help us get a better idea of the remaining work as we 
incorporate feedback from the community and to document that outstanding effort 
more clearly. 


> On 6 Mar 2023, at 09:24, Benjamin Lerer  wrote:
> 
> Sorry, I realized that when I started the discussion I probably did not frame 
> it enough as I see that it is now going into different directions.
> The concerns I am seeing are:
> 1) A too small amount of time between releases  is inefficient from a 
> development perspective and from a user perspective. From a development point 
> of view because we are missing time to deliver some features. From a user 
> perspective because they cannot follow with the upgrade.
> 2) Some features are so anticipated (Accord being the one mentioned) that 
> people would prefer to delay the release to make sure that it is available as 
> soon as possible.
> 3) We do not know how long we need to go from the freeze to GA. We hope for 2 
> months but our last experience was 6 months. So delaying the release could 
> mean not releasing this year.
> 4) For people doing marketing it is really hard to promote a product when you 
> do not know when the release will come and what features might be there.
> 
> All those concerns are probably even made worse by the fact that we do not 
> have a clear visibility on where we are.
> 
> Should we clarify that part first by getting an idea of the status of the 
> different CEPs and other big pieces of work? From there we could agree on 
> some timeline for the freeze. We could then discuss how to make predictable 
> the time from freeze to GA.  
> 
> 
> 
> Le sam. 4 mars 2023 à 18:14, Josh McKenzie  > a écrit :
>> (for convenience sake, I'm referring to both Major and Minor semver releases 
>> as "major" in this email)
>> 
>>> The big feature from our perspective for 5.0 is ACCORD (CEP-15) and I would 
>>> advocate to delay until this has sufficient quality to be in production. 
>> This approach can be pretty unpredictable in this domain; often unforeseen 
>> things come up in implementation that can give you a long tail on something 
>> being production ready. For the record - I don't intend to single Accord out 
>> at all on this front, quite the opposite given how much rigor's gone into 
>> the design and implementation. I'm just thinking from my personal 
>> experience: everything I've worked on, overseen, or followed closely on this 
>> codebase always has a few tricks up its sleeve along the way to having 
>> edge-cases stabilized.
>> 
>> Much like on some other recent topics, I think there's a nuanced middle 
>> ground where we take things on a case-by-case basis. Some factors that have 
>> come up in this thread that resonated with me:
>> 
>> For a given potential release date 'X':
>> 1. How long has it been since the last release?
>> 2. How long do we expect qualification to take from a "freeze" (i.e. no new 
>> improvement or features, branch) point?
>> 3. What body of merged production ready work is available?
>> 4. What body of new work do we have high confidence will be ready within Y 
>> time?
>> 
>> I think it's worth defining a loose "minimum bound and upper bound" on 
>> release cycles we want to try and stick with barring extenuating 
>> circumstances. For instance: try not to release sooner than maybe 10 months 
>> out from a prior major, and try not to release later than 18 months out from 
>> a prior major. Make exceptions if truly exceptional things land, are about 
>> to land, or bugs are discovered around those boundaries.
>> 
>> Applying the above framework to what we have in flight, our last release 
>> date, expectations on CI, etc - targeting an early fall freeze (pending CEP 
>> status) and mid to late fall or December release "feels right" to me.
>> 
>> With the exception, of course, that if something merges earlier, is stable, 
>> and we feel is valuable enough to cut a major based on that, we do it.
>> 
>> ~Josh
>> 
>> On Fri, Mar 3, 2023, at 7:37 PM, German Eichberger via dev wrote:
>>> Hi,
>>> 
>>> We shouldn't release just for releases sake. Are there enough new features 
>>> and are they working well enough (quality!).
>>> 
>>> The big feature from our perspective for 5.0 is ACCORD (CEP-15) and I would 
>>> advocate to delay until this has sufficient quality to be in production. 
>>> 
>>> Just because something is released doesn't mean anyone is gonna use it. To 
>>> add some operator perspective: Every time there is a new release we need to 
>>> decide 
>>> 1) are we supporting it 
>>> 2) which other release can we deprecate 
>>> 
>>> and potentially migrate people - which is also a tough sell if there are no 
>>> significant features and/or breaking changes.  So from my perspectiv

Re: [EXTERNAL] [DISCUSS] Next release date

2023-03-07 Thread Mick Semb Wever
On Tue, 7 Mar 2023 at 11:20, Sam Tunnicliffe  wrote:

> Currently, we anticipate CEP-21 being in a mergeable state around late
> July/August.
>


For me this is a more important reason to delay the branch date than
CEP-15, it being such a foundational change. Also, this is the first
feedback we've had that any CEP won't land by May.

Thank you Sam (and German) for the directness in your posts.

My concern remaining is the unknown branch to GA time, and the real risk of
not seeing a GA release (with highly anticipated features) landing this
year. I hope that delaying the branch date is accompanied with broad
commitments to fixing flakies, improving QA/CI, everything etc etc, so our
hope of a 2 month GA journey and a more stable trunk is realised.


Re: [DISCUSS] Should separate snapshots with the same name be allowed in different tables?

2023-03-07 Thread Paulo Motta
> Is it right to assume that we need to address this first in order to
implement 18102? If we leave it as it is and we implement vtable with
"unique snapshot name globally" in mind and we design that vtable like
that, until we fix this issue, snapshots would be overwritten on top of
each other if their names are same.

The vtable will just not include the timestamp in the primary key, since
all table snapshots will have the same timestamp. Currently it's not
possible to overwrite an existing snapshot if it already exists in a table.

On Tue, Mar 7, 2023 at 4:30 AM Miklosovic, Stefan <
stefan.mikloso...@netapp.com> wrote:

> I agree too. Given the fact that the method checking the uniqueness of a
> snapshot name was implemented first, it seems to me that the second method
> which is not checking it just forgot to do that rather than intentionally
> doing it like that.
>
> Is it right to assume that we need to address this first in order to
> implement 18102? If we leave it as it is and we implement vtable with
> "unique snapshot name globally" in mind and we design that vtable like
> that, until we fix this issue, snapshots would be overwritten on top of
> each other if their names are same.
>
>
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L4266
>
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L4317
>
> 
> From: Paulo Motta 
> Sent: Monday, March 6, 2023 5:59
> To: Cassandra DEV
> Subject: [DISCUSS] Should separate snapshots with the same name be allowed
> in different tables?
>
> NetApp Security WARNING: This is an external email. Do not click links or
> open attachments unless you recognize the sender and know the content is
> safe.
>
>
>
> Hi,
>
> It's possible to create a snapshot on a set of tables with the same name
> name/tag with:
> $ nodetool snapshot --kt-list ks.tb,system.local -t mysnapshot
> $ nodetool listsnapshots
> [..]
> mysnapshot  system  local  1.16 KiB  21.47 KiB2023-03-02T13:19:13.757Z
> mysnapshot  ks  tb 1.02 KiB   6.08 KiB2023-03-02T13:19:13.757Z
>
> It's also possible to create a snapshot with an existing name, as long as
> it's in a different table:
> $ nodetool snapshot --kt-list ks.tb2 -t mysnapshot
> $ nodetool listsnapshots
> [..]
> mysnapshot ks   tb21.16 KiB  21.47 KiB
> 2023-03-02T13:19:42.140Z
> mysnapshot ks   tb 1.02 KiB  6.08 KiB
>  2023-03-02T13:19:13.757Z
> mysnapshot system   local  107 bytes 6.98 KiB
>  2023-03-02T13:19:13.757Z
>
> I found this a bit surprising, since I would expect a snapshot with the
> same name on different tables to represent the same logical snapshot taken
> at the same point-in-time.
>
> This affects the design of the snapshots virtual table schema on
> CASSANDRA-18102[1] because the timestamp needs to be included in the
> primary key to allow distinguishing between different snapshots with the
> same name, which seems a bit counter-intuitive.
>
> I think we should disallow creating a snapshot if another snapshot already
> exists with the same name in any other table, so snapshots with the same
> name on different tables are guaranteed to belong to the same logical
> point-in-time snapshot.
>
> Does anyone think it's useful to be able to create separate snapshots with
> the same names across different tables and we should retain this support?
>
> Let me know what do you think,
>
> Paulo
>
> [1] - https://issues.apache.org/jira/browse/CASSANDRA-18102
>
>


Removal of DateTieredCompactionStrategy in 5.0

2023-03-07 Thread Miklosovic, Stefan
Hi list,

I want to highlight this ticket (1). I was waiting for trunk being on version 
"5.0" officially so we can get rid of this compaction strategy which was 
deprecated in 3.8. If we waited one major with deprecated strategy (4.0), I 
think it is eligible for the actual deletion in the next major which is 5.0.

Are people OK with the removal of DTCS in 5.0?

(1) https://issues.apache.org/jira/browse/CASSANDRA-18043
(2) https://github.com/apache/cassandra/pull/2015

Regards

Re: Removal of DateTieredCompactionStrategy in 5.0

2023-03-07 Thread Mick Semb Wever
>
> Are people OK with the removal of DTCS in 5.0?
>


Yes.


Re: Removal of DateTieredCompactionStrategy in 5.0

2023-03-07 Thread Brandon Williams
Yes.

Kind Regards,
Brandon

On Tue, Mar 7, 2023 at 9:14 AM Miklosovic, Stefan
 wrote:
>
> Hi list,
>
> I want to highlight this ticket (1). I was waiting for trunk being on version 
> "5.0" officially so we can get rid of this compaction strategy which was 
> deprecated in 3.8. If we waited one major with deprecated strategy (4.0), I 
> think it is eligible for the actual deletion in the next major which is 5.0.
>
> Are people OK with the removal of DTCS in 5.0?
>
> (1) https://issues.apache.org/jira/browse/CASSANDRA-18043
> (2) https://github.com/apache/cassandra/pull/2015
>
> Regards


Re: Degradation of availability when using NTS and RF > number of racks

2023-03-07 Thread Miklosovic, Stefan
Thanks everybody for the feedback.

I think that emitting a warning upon keyspace creation (and alteration) should 
be enough for starters. If somebody can not live without 100% bullet proof 
solution over time we might choose some approach from the offered ones. As the 
saying goes there is no silver bullet. If we decide to implement that new 
strategy, we would probably emit warnings anyway on NTS but it would be already 
done so just new strategy would be provided.


From: Paulo Motta 
Sent: Monday, March 6, 2023 17:48
To: dev@cassandra.apache.org
Subject: Re: Degradation of availability when using NTS and RF > number of racks

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



It's a bit unfortunate that NTS does not maintain the ability to lose a rack 
without loss of quorum for RF > #racks > 2, since this can be easily achieved 
by evenly placing replicas across all racks.

Since RackAwareTopologyStrategy is a superset of NetworkTopologyStrategy, can't 
we just use the new correct placement logic for newly created keyspaces instead 
of having a new strategy?

The placement logic would be backwards-compatible for RF <= #racks. On upgrade, 
we could mark existing keyspaces with RF > #racks with 
use_legacy_replica_placement=true to maintain backwards compatibility and log a 
warning that the rack loss guarantee is not maintained for keyspaces created 
before the fix. Old keyspaces with RF <=#racks would still work with the new 
replica placement. The downside is that we would need to keep the old NTS logic 
around, or we could eventually deprecate it and require users to migrate 
keyspaces using the legacy placement strategy.

Alternatively we could have RackAwareTopologyStrategy and fail NTS keyspace 
creation for RF > #racks and indicate users to use RackAwareTopologyStrategy to 
maintain the quorum guarantee on rack loss or set an override flag 
"support_quorum_on_rack_loss=false". This feels a bit iffy though since it 
could potentially confuse users about when to use each strategy.

On Mon, Mar 6, 2023 at 5:51 AM Miklosovic, Stefan 
mailto:stefan.mikloso...@netapp.com>> wrote:
Hi all,

some time ago we identified an issue with NetworkTopologyStrategy. The problem 
is that when RF > number of racks, it may happen that NTS places replicas in 
such a way that when whole rack is lost, we lose QUORUM and data are not 
available anymore if QUORUM CL is used.

To illustrate this problem, lets have this setup:

9 nodes in 1 DC, 3 racks, 3 nodes per rack. RF = 5. Then, NTS could place 
replicas like this: 3 replicas in rack1, 1 replica in rack2, 1 replica in 
rack3. Hence, when rack1 is lost, we do not have QUORUM.

It seems to us that there is already some logic around this scenario (1) but 
the implementation is not entirely correct. This solution is not computing the 
replica placement correctly so the above problem would be addressed.

We created a draft here (2, 3) which fixes it.

There is also a test which simulates this scenario. When I assign 256 tokens to 
each node randomly (by same mean as generatetokens command uses) and I try to 
compute natural replicas for 1 billion random tokens and I compute how many 
cases there will be when 3 replicas out of 5 are inserted in the same rack (so 
by losing it we would lose quorum), for above setup I get around 6%.

For 12 nodes, 3 racks, 4 nodes per rack, rf = 5, this happens in 10% cases.

To interpret this number, it basically means that with such topology, RF and 
CL, when a random rack fails completely, when doing a random read, there is 6% 
chance that data will not be available (or 10%, respectively).

One caveat here is that NTS is not compatible with this new strategy anymore 
because it will place replicas differently. So I guess that fixing this in NTS 
will not be possible because of upgrades. I think people would need to setup 
completely new keyspace and somehow migrate data if they wish or they just 
start from scratch with this strategy.

Questions:

1) do you think this is meaningful to fix and it might end up in trunk?

2) should not we just ban this scenario entirely? It might be possible to check 
the configuration upon keyspace creation (rf > num of racks) and if we see this 
is problematic we would just fail that query? Guardrail maybe?

3) people in the ticket mention writing "CEP" for this but I do not see any 
reason to do so. It is just a strategy as any other. What would that CEP would 
even be about? Is this necessary?

Regards

(1) 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java#L126-L128
(2) https://github.com/apache/cassandra/pull/2191
(3) https://issues.apache.org/jira/browse/CASSANDRA-16203


Re: Removal of DateTieredCompactionStrategy in 5.0

2023-03-07 Thread Ekaterina Dimitrova
Yes and thank you!

On Tue, 7 Mar 2023 at 10:28, Brandon Williams  wrote:

> Yes.
>
> Kind Regards,
> Brandon
>
> On Tue, Mar 7, 2023 at 9:14 AM Miklosovic, Stefan
>  wrote:
> >
> > Hi list,
> >
> > I want to highlight this ticket (1). I was waiting for trunk being on
> version "5.0" officially so we can get rid of this compaction strategy
> which was deprecated in 3.8. If we waited one major with deprecated
> strategy (4.0), I think it is eligible for the actual deletion in the next
> major which is 5.0.
> >
> > Are people OK with the removal of DTCS in 5.0?
> >
> > (1) https://issues.apache.org/jira/browse/CASSANDRA-18043
> > (2) https://github.com/apache/cassandra/pull/2015
> >
> > Regards
>


Re: Degradation of availability when using NTS and RF > number of racks

2023-03-07 Thread Benedict
My view is that if this is a pretty serious bug. I wonder if transactional 
metadata will make it possible to safely fix this for users without rebuilding 
(only via opt-in, of course).

> On 7 Mar 2023, at 15:54, Miklosovic, Stefan  
> wrote:
> 
> Thanks everybody for the feedback.
> 
> I think that emitting a warning upon keyspace creation (and alteration) 
> should be enough for starters. If somebody can not live without 100% bullet 
> proof solution over time we might choose some approach from the offered ones. 
> As the saying goes there is no silver bullet. If we decide to implement that 
> new strategy, we would probably emit warnings anyway on NTS but it would be 
> already done so just new strategy would be provided.
> 
> 
> From: Paulo Motta 
> Sent: Monday, March 6, 2023 17:48
> To: dev@cassandra.apache.org
> Subject: Re: Degradation of availability when using NTS and RF > number of 
> racks
> 
> NetApp Security WARNING: This is an external email. Do not click links or 
> open attachments unless you recognize the sender and know the content is safe.
> 
> 
> 
> It's a bit unfortunate that NTS does not maintain the ability to lose a rack 
> without loss of quorum for RF > #racks > 2, since this can be easily achieved 
> by evenly placing replicas across all racks.
> 
> Since RackAwareTopologyStrategy is a superset of NetworkTopologyStrategy, 
> can't we just use the new correct placement logic for newly created keyspaces 
> instead of having a new strategy?
> 
> The placement logic would be backwards-compatible for RF <= #racks. On 
> upgrade, we could mark existing keyspaces with RF > #racks with 
> use_legacy_replica_placement=true to maintain backwards compatibility and log 
> a warning that the rack loss guarantee is not maintained for keyspaces 
> created before the fix. Old keyspaces with RF <=#racks would still work with 
> the new replica placement. The downside is that we would need to keep the old 
> NTS logic around, or we could eventually deprecate it and require users to 
> migrate keyspaces using the legacy placement strategy.
> 
> Alternatively we could have RackAwareTopologyStrategy and fail NTS keyspace 
> creation for RF > #racks and indicate users to use RackAwareTopologyStrategy 
> to maintain the quorum guarantee on rack loss or set an override flag 
> "support_quorum_on_rack_loss=false". This feels a bit iffy though since it 
> could potentially confuse users about when to use each strategy.
> 
> On Mon, Mar 6, 2023 at 5:51 AM Miklosovic, Stefan 
> mailto:stefan.mikloso...@netapp.com>> wrote:
> Hi all,
> 
> some time ago we identified an issue with NetworkTopologyStrategy. The 
> problem is that when RF > number of racks, it may happen that NTS places 
> replicas in such a way that when whole rack is lost, we lose QUORUM and data 
> are not available anymore if QUORUM CL is used.
> 
> To illustrate this problem, lets have this setup:
> 
> 9 nodes in 1 DC, 3 racks, 3 nodes per rack. RF = 5. Then, NTS could place 
> replicas like this: 3 replicas in rack1, 1 replica in rack2, 1 replica in 
> rack3. Hence, when rack1 is lost, we do not have QUORUM.
> 
> It seems to us that there is already some logic around this scenario (1) but 
> the implementation is not entirely correct. This solution is not computing 
> the replica placement correctly so the above problem would be addressed.
> 
> We created a draft here (2, 3) which fixes it.
> 
> There is also a test which simulates this scenario. When I assign 256 tokens 
> to each node randomly (by same mean as generatetokens command uses) and I try 
> to compute natural replicas for 1 billion random tokens and I compute how 
> many cases there will be when 3 replicas out of 5 are inserted in the same 
> rack (so by losing it we would lose quorum), for above setup I get around 6%.
> 
> For 12 nodes, 3 racks, 4 nodes per rack, rf = 5, this happens in 10% cases.
> 
> To interpret this number, it basically means that with such topology, RF and 
> CL, when a random rack fails completely, when doing a random read, there is 
> 6% chance that data will not be available (or 10%, respectively).
> 
> One caveat here is that NTS is not compatible with this new strategy anymore 
> because it will place replicas differently. So I guess that fixing this in 
> NTS will not be possible because of upgrades. I think people would need to 
> setup completely new keyspace and somehow migrate data if they wish or they 
> just start from scratch with this strategy.
> 
> Questions:
> 
> 1) do you think this is meaningful to fix and it might end up in trunk?
> 
> 2) should not we just ban this scenario entirely? It might be possible to 
> check the configuration upon keyspace creation (rf > num of racks) and if we 
> see this is problematic we would just fail that query? Guardrail maybe?
> 
> 3) people in the ticket mention writing "CEP" for this but I do not see any 
> reason to do so. It is just a strategy as any other.

Re: Degradation of availability when using NTS and RF > number of racks

2023-03-07 Thread Jeremiah D Jordan
I agree with Paulo, it would be nice if we could figure out some way to make 
new NTS work correctly, with a parameter to fall back to the “bad” behavior, so 
that people restoring backups to a new cluster can get the right behavior to 
match their backups.
The problem with only fixing this in a new strategy is we have a ton of 
tutorials and docs out there which tell people to use NTS, so it would be great 
if we could keep “use NTS” as the recommendation.  Throwing a warning when 
someone uses NTS is kind of user hostile.  If someone just read some tutorial 
or doc which told them “make your key space this way” and then when they do 
that the database yells at them telling them they did it wrong, it is not a 
great experience.

-Jeremiah

> On Mar 7, 2023, at 10:16 AM, Benedict  wrote:
> 
> My view is that if this is a pretty serious bug. I wonder if transactional 
> metadata will make it possible to safely fix this for users without 
> rebuilding (only via opt-in, of course).
> 
>> On 7 Mar 2023, at 15:54, Miklosovic, Stefan  
>> wrote:
>> 
>> Thanks everybody for the feedback.
>> 
>> I think that emitting a warning upon keyspace creation (and alteration) 
>> should be enough for starters. If somebody can not live without 100% bullet 
>> proof solution over time we might choose some approach from the offered 
>> ones. As the saying goes there is no silver bullet. If we decide to 
>> implement that new strategy, we would probably emit warnings anyway on NTS 
>> but it would be already done so just new strategy would be provided.
>> 
>> 
>> From: Paulo Motta 
>> Sent: Monday, March 6, 2023 17:48
>> To: dev@cassandra.apache.org
>> Subject: Re: Degradation of availability when using NTS and RF > number of 
>> racks
>> 
>> NetApp Security WARNING: This is an external email. Do not click links or 
>> open attachments unless you recognize the sender and know the content is 
>> safe.
>> 
>> 
>> 
>> It's a bit unfortunate that NTS does not maintain the ability to lose a rack 
>> without loss of quorum for RF > #racks > 2, since this can be easily 
>> achieved by evenly placing replicas across all racks.
>> 
>> Since RackAwareTopologyStrategy is a superset of NetworkTopologyStrategy, 
>> can't we just use the new correct placement logic for newly created 
>> keyspaces instead of having a new strategy?
>> 
>> The placement logic would be backwards-compatible for RF <= #racks. On 
>> upgrade, we could mark existing keyspaces with RF > #racks with 
>> use_legacy_replica_placement=true to maintain backwards compatibility and 
>> log a warning that the rack loss guarantee is not maintained for keyspaces 
>> created before the fix. Old keyspaces with RF <=#racks would still work with 
>> the new replica placement. The downside is that we would need to keep the 
>> old NTS logic around, or we could eventually deprecate it and require users 
>> to migrate keyspaces using the legacy placement strategy.
>> 
>> Alternatively we could have RackAwareTopologyStrategy and fail NTS keyspace 
>> creation for RF > #racks and indicate users to use RackAwareTopologyStrategy 
>> to maintain the quorum guarantee on rack loss or set an override flag 
>> "support_quorum_on_rack_loss=false". This feels a bit iffy though since it 
>> could potentially confuse users about when to use each strategy.
>> 
>> On Mon, Mar 6, 2023 at 5:51 AM Miklosovic, Stefan 
>> mailto:stefan.mikloso...@netapp.com>> wrote:
>> Hi all,
>> 
>> some time ago we identified an issue with NetworkTopologyStrategy. The 
>> problem is that when RF > number of racks, it may happen that NTS places 
>> replicas in such a way that when whole rack is lost, we lose QUORUM and data 
>> are not available anymore if QUORUM CL is used.
>> 
>> To illustrate this problem, lets have this setup:
>> 
>> 9 nodes in 1 DC, 3 racks, 3 nodes per rack. RF = 5. Then, NTS could place 
>> replicas like this: 3 replicas in rack1, 1 replica in rack2, 1 replica in 
>> rack3. Hence, when rack1 is lost, we do not have QUORUM.
>> 
>> It seems to us that there is already some logic around this scenario (1) but 
>> the implementation is not entirely correct. This solution is not computing 
>> the replica placement correctly so the above problem would be addressed.
>> 
>> We created a draft here (2, 3) which fixes it.
>> 
>> There is also a test which simulates this scenario. When I assign 256 tokens 
>> to each node randomly (by same mean as generatetokens command uses) and I 
>> try to compute natural replicas for 1 billion random tokens and I compute 
>> how many cases there will be when 3 replicas out of 5 are inserted in the 
>> same rack (so by losing it we would lose quorum), for above setup I get 
>> around 6%.
>> 
>> For 12 nodes, 3 racks, 4 nodes per rack, rf = 5, this happens in 10% cases.
>> 
>> To interpret this number, it basically means that with such topology, RF and 
>> CL, when a random rack fails completely, when doing a random read, there is 

Re: Degradation of availability when using NTS and RF > number of racks

2023-03-07 Thread Derek Chen-Becker
I think that the warning would only be thrown in the case where a
potentially QUORUM-busting configuration is used. I think it would be a
worse experience to not warn and let the user discover later when they
can't write at QUORUM.

Cheers,

Derek

On Tue, Mar 7, 2023 at 9:32 AM Jeremiah D Jordan 
wrote:

> I agree with Paulo, it would be nice if we could figure out some way to
> make new NTS work correctly, with a parameter to fall back to the “bad”
> behavior, so that people restoring backups to a new cluster can get the
> right behavior to match their backups.
> The problem with only fixing this in a new strategy is we have a ton of
> tutorials and docs out there which tell people to use NTS, so it would be
> great if we could keep “use NTS” as the recommendation.  Throwing a warning
> when someone uses NTS is kind of user hostile.  If someone just read some
> tutorial or doc which told them “make your key space this way” and then
> when they do that the database yells at them telling them they did it
> wrong, it is not a great experience.
>
> -Jeremiah
>
> > On Mar 7, 2023, at 10:16 AM, Benedict  wrote:
> >
> > My view is that if this is a pretty serious bug. I wonder if
> transactional metadata will make it possible to safely fix this for users
> without rebuilding (only via opt-in, of course).
> >
> >> On 7 Mar 2023, at 15:54, Miklosovic, Stefan <
> stefan.mikloso...@netapp.com> wrote:
> >>
> >> Thanks everybody for the feedback.
> >>
> >> I think that emitting a warning upon keyspace creation (and alteration)
> should be enough for starters. If somebody can not live without 100% bullet
> proof solution over time we might choose some approach from the offered
> ones. As the saying goes there is no silver bullet. If we decide to
> implement that new strategy, we would probably emit warnings anyway on NTS
> but it would be already done so just new strategy would be provided.
> >>
> >> 
> >> From: Paulo Motta 
> >> Sent: Monday, March 6, 2023 17:48
> >> To: dev@cassandra.apache.org
> >> Subject: Re: Degradation of availability when using NTS and RF > number
> of racks
> >>
> >> NetApp Security WARNING: This is an external email. Do not click links
> or open attachments unless you recognize the sender and know the content is
> safe.
> >>
> >>
> >>
> >> It's a bit unfortunate that NTS does not maintain the ability to lose a
> rack without loss of quorum for RF > #racks > 2, since this can be easily
> achieved by evenly placing replicas across all racks.
> >>
> >> Since RackAwareTopologyStrategy is a superset of
> NetworkTopologyStrategy, can't we just use the new correct placement logic
> for newly created keyspaces instead of having a new strategy?
> >>
> >> The placement logic would be backwards-compatible for RF <= #racks. On
> upgrade, we could mark existing keyspaces with RF > #racks with
> use_legacy_replica_placement=true to maintain backwards compatibility and
> log a warning that the rack loss guarantee is not maintained for keyspaces
> created before the fix. Old keyspaces with RF <=#racks would still work
> with the new replica placement. The downside is that we would need to keep
> the old NTS logic around, or we could eventually deprecate it and require
> users to migrate keyspaces using the legacy placement strategy.
> >>
> >> Alternatively we could have RackAwareTopologyStrategy and fail NTS
> keyspace creation for RF > #racks and indicate users to use
> RackAwareTopologyStrategy to maintain the quorum guarantee on rack loss or
> set an override flag "support_quorum_on_rack_loss=false". This feels a bit
> iffy though since it could potentially confuse users about when to use each
> strategy.
> >>
> >> On Mon, Mar 6, 2023 at 5:51 AM Miklosovic, Stefan <
> stefan.mikloso...@netapp.com> wrote:
> >> Hi all,
> >>
> >> some time ago we identified an issue with NetworkTopologyStrategy. The
> problem is that when RF > number of racks, it may happen that NTS places
> replicas in such a way that when whole rack is lost, we lose QUORUM and
> data are not available anymore if QUORUM CL is used.
> >>
> >> To illustrate this problem, lets have this setup:
> >>
> >> 9 nodes in 1 DC, 3 racks, 3 nodes per rack. RF = 5. Then, NTS could
> place replicas like this: 3 replicas in rack1, 1 replica in rack2, 1
> replica in rack3. Hence, when rack1 is lost, we do not have QUORUM.
> >>
> >> It seems to us that there is already some logic around this scenario
> (1) but the implementation is not entirely correct. This solution is not
> computing the replica placement correctly so the above problem would be
> addressed.
> >>
> >> We created a draft here (2, 3) which fixes it.
> >>
> >> There is also a test which simulates this scenario. When I assign 256
> tokens to each node randomly (by same mean as generatetokens command uses)
> and I try to compute natural replicas for 1 billion random tokens and I
> compute how many cases there will be 

Re: Degradation of availability when using NTS and RF > number of racks

2023-03-07 Thread Aaron Ploetz
> I think it would be a worse experience to not warn and let the user
discover later when they can't write at QUORUM.

Agree.

Should we add a note in the cassandra.yaml comments as well?  Maybe in the
spot where default_keyspace_rf is defined?  On the other hand, that section
is pretty "wordy" already.  But calling it out in the yaml might not be a
bad idea.

Thanks,

Aaron


On Tue, Mar 7, 2023 at 11:12 AM Derek Chen-Becker 
wrote:

> I think that the warning would only be thrown in the case where a
> potentially QUORUM-busting configuration is used. I think it would be a
> worse experience to not warn and let the user discover later when they
> can't write at QUORUM.
>
> Cheers,
>
> Derek
>
> On Tue, Mar 7, 2023 at 9:32 AM Jeremiah D Jordan <
> jeremiah.jor...@gmail.com> wrote:
>
>> I agree with Paulo, it would be nice if we could figure out some way to
>> make new NTS work correctly, with a parameter to fall back to the “bad”
>> behavior, so that people restoring backups to a new cluster can get the
>> right behavior to match their backups.
>> The problem with only fixing this in a new strategy is we have a ton of
>> tutorials and docs out there which tell people to use NTS, so it would be
>> great if we could keep “use NTS” as the recommendation.  Throwing a warning
>> when someone uses NTS is kind of user hostile.  If someone just read some
>> tutorial or doc which told them “make your key space this way” and then
>> when they do that the database yells at them telling them they did it
>> wrong, it is not a great experience.
>>
>> -Jeremiah
>>
>> > On Mar 7, 2023, at 10:16 AM, Benedict  wrote:
>> >
>> > My view is that if this is a pretty serious bug. I wonder if
>> transactional metadata will make it possible to safely fix this for users
>> without rebuilding (only via opt-in, of course).
>> >
>> >> On 7 Mar 2023, at 15:54, Miklosovic, Stefan <
>> stefan.mikloso...@netapp.com> wrote:
>> >>
>> >> Thanks everybody for the feedback.
>> >>
>> >> I think that emitting a warning upon keyspace creation (and
>> alteration) should be enough for starters. If somebody can not live without
>> 100% bullet proof solution over time we might choose some approach from the
>> offered ones. As the saying goes there is no silver bullet. If we decide to
>> implement that new strategy, we would probably emit warnings anyway on NTS
>> but it would be already done so just new strategy would be provided.
>> >>
>> >> 
>> >> From: Paulo Motta 
>> >> Sent: Monday, March 6, 2023 17:48
>> >> To: dev@cassandra.apache.org
>> >> Subject: Re: Degradation of availability when using NTS and RF >
>> number of racks
>> >>
>> >> NetApp Security WARNING: This is an external email. Do not click links
>> or open attachments unless you recognize the sender and know the content is
>> safe.
>> >>
>> >>
>> >>
>> >> It's a bit unfortunate that NTS does not maintain the ability to lose
>> a rack without loss of quorum for RF > #racks > 2, since this can be easily
>> achieved by evenly placing replicas across all racks.
>> >>
>> >> Since RackAwareTopologyStrategy is a superset of
>> NetworkTopologyStrategy, can't we just use the new correct placement logic
>> for newly created keyspaces instead of having a new strategy?
>> >>
>> >> The placement logic would be backwards-compatible for RF <= #racks. On
>> upgrade, we could mark existing keyspaces with RF > #racks with
>> use_legacy_replica_placement=true to maintain backwards compatibility and
>> log a warning that the rack loss guarantee is not maintained for keyspaces
>> created before the fix. Old keyspaces with RF <=#racks would still work
>> with the new replica placement. The downside is that we would need to keep
>> the old NTS logic around, or we could eventually deprecate it and require
>> users to migrate keyspaces using the legacy placement strategy.
>> >>
>> >> Alternatively we could have RackAwareTopologyStrategy and fail NTS
>> keyspace creation for RF > #racks and indicate users to use
>> RackAwareTopologyStrategy to maintain the quorum guarantee on rack loss or
>> set an override flag "support_quorum_on_rack_loss=false". This feels a bit
>> iffy though since it could potentially confuse users about when to use each
>> strategy.
>> >>
>> >> On Mon, Mar 6, 2023 at 5:51 AM Miklosovic, Stefan <
>> stefan.mikloso...@netapp.com> wrote:
>> >> Hi all,
>> >>
>> >> some time ago we identified an issue with NetworkTopologyStrategy. The
>> problem is that when RF > number of racks, it may happen that NTS places
>> replicas in such a way that when whole rack is lost, we lose QUORUM and
>> data are not available anymore if QUORUM CL is used.
>> >>
>> >> To illustrate this problem, lets have this setup:
>> >>
>> >> 9 nodes in 1 DC, 3 racks, 3 nodes per rack. RF = 5. Then, NTS could
>> place replicas like this: 3 replicas in rack1, 1 replica in rack2, 1
>> replica in rack3. Hence, when rack1 is lost, we do not have QUORUM

Re: Degradation of availability when using NTS and RF > number of racks

2023-03-07 Thread Jeremiah D Jordan
Right, why I said we should make NTS do the right thing, rather than throwing a 
warning.  Doing the right thing, and not getting a warning, is the best 
behavior.

> On Mar 7, 2023, at 11:12 AM, Derek Chen-Becker  wrote:
> 
> I think that the warning would only be thrown in the case where a potentially 
> QUORUM-busting configuration is used. I think it would be a worse experience 
> to not warn and let the user discover later when they can't write at QUORUM.
> 
> Cheers,
> 
> Derek
> 
> On Tue, Mar 7, 2023 at 9:32 AM Jeremiah D Jordan  > wrote:
>> I agree with Paulo, it would be nice if we could figure out some way to make 
>> new NTS work correctly, with a parameter to fall back to the “bad” behavior, 
>> so that people restoring backups to a new cluster can get the right behavior 
>> to match their backups.
>> The problem with only fixing this in a new strategy is we have a ton of 
>> tutorials and docs out there which tell people to use NTS, so it would be 
>> great if we could keep “use NTS” as the recommendation.  Throwing a warning 
>> when someone uses NTS is kind of user hostile.  If someone just read some 
>> tutorial or doc which told them “make your key space this way” and then when 
>> they do that the database yells at them telling them they did it wrong, it 
>> is not a great experience.
>> 
>> -Jeremiah
>> 
>> > On Mar 7, 2023, at 10:16 AM, Benedict > > > wrote:
>> > 
>> > My view is that if this is a pretty serious bug. I wonder if transactional 
>> > metadata will make it possible to safely fix this for users without 
>> > rebuilding (only via opt-in, of course).
>> > 
>> >> On 7 Mar 2023, at 15:54, Miklosovic, Stefan > >> > wrote:
>> >> 
>> >> Thanks everybody for the feedback.
>> >> 
>> >> I think that emitting a warning upon keyspace creation (and alteration) 
>> >> should be enough for starters. If somebody can not live without 100% 
>> >> bullet proof solution over time we might choose some approach from the 
>> >> offered ones. As the saying goes there is no silver bullet. If we decide 
>> >> to implement that new strategy, we would probably emit warnings anyway on 
>> >> NTS but it would be already done so just new strategy would be provided.
>> >> 
>> >> 
>> >> From: Paulo Motta > >> >
>> >> Sent: Monday, March 6, 2023 17:48
>> >> To: dev@cassandra.apache.org 
>> >> Subject: Re: Degradation of availability when using NTS and RF > number 
>> >> of racks
>> >> 
>> >> NetApp Security WARNING: This is an external email. Do not click links or 
>> >> open attachments unless you recognize the sender and know the content is 
>> >> safe.
>> >> 
>> >> 
>> >> 
>> >> It's a bit unfortunate that NTS does not maintain the ability to lose a 
>> >> rack without loss of quorum for RF > #racks > 2, since this can be easily 
>> >> achieved by evenly placing replicas across all racks.
>> >> 
>> >> Since RackAwareTopologyStrategy is a superset of NetworkTopologyStrategy, 
>> >> can't we just use the new correct placement logic for newly created 
>> >> keyspaces instead of having a new strategy?
>> >> 
>> >> The placement logic would be backwards-compatible for RF <= #racks. On 
>> >> upgrade, we could mark existing keyspaces with RF > #racks with 
>> >> use_legacy_replica_placement=true to maintain backwards compatibility and 
>> >> log a warning that the rack loss guarantee is not maintained for 
>> >> keyspaces created before the fix. Old keyspaces with RF <=#racks would 
>> >> still work with the new replica placement. The downside is that we would 
>> >> need to keep the old NTS logic around, or we could eventually deprecate 
>> >> it and require users to migrate keyspaces using the legacy placement 
>> >> strategy.
>> >> 
>> >> Alternatively we could have RackAwareTopologyStrategy and fail NTS 
>> >> keyspace creation for RF > #racks and indicate users to use 
>> >> RackAwareTopologyStrategy to maintain the quorum guarantee on rack loss 
>> >> or set an override flag "support_quorum_on_rack_loss=false". This feels a 
>> >> bit iffy though since it could potentially confuse users about when to 
>> >> use each strategy.
>> >> 
>> >> On Mon, Mar 6, 2023 at 5:51 AM Miklosovic, Stefan 
>> >> > >> > >> >> wrote:
>> >> Hi all,
>> >> 
>> >> some time ago we identified an issue with NetworkTopologyStrategy. The 
>> >> problem is that when RF > number of racks, it may happen that NTS places 
>> >> replicas in such a way that when whole rack is lost, we lose QUORUM and 
>> >> data are not available anymore if QUORUM CL is used.
>> >> 
>> >> To illustrate this problem, lets have this setup:
>> >> 
>> >> 9 nodes in 1 DC, 3 racks, 3 nodes per rack. RF = 5. Then, NTS could place 
>> >> replicas li

Re: Degradation of availability when using NTS and RF > number of racks

2023-03-07 Thread Miklosovic, Stefan
I am glad more people joined and expressed their opinions after my last e-mail. 
It seems to me that there is a consensus in having it fixed directly in NTS and 
make it little bit more smart about the replica placement but we should still 
have a way how to do it "the old way".

There is a lot of time until 5.0. So, I would say, lets explore this "all logic 
in NTS" approach. I agree having new strategy and trying to explain to people 
what is the difference is quite confusing if everybody is pretty much used to 
NTS already.



From: Jeremiah D Jordan 
Sent: Tuesday, March 7, 2023 19:31
To: dev@cassandra.apache.org
Subject: Re: Degradation of availability when using NTS and RF > number of racks

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



Right, why I said we should make NTS do the right thing, rather than throwing a 
warning.  Doing the right thing, and not getting a warning, is the best 
behavior.

On Mar 7, 2023, at 11:12 AM, Derek Chen-Becker  wrote:

I think that the warning would only be thrown in the case where a potentially 
QUORUM-busting configuration is used. I think it would be a worse experience to 
not warn and let the user discover later when they can't write at QUORUM.

Cheers,

Derek

On Tue, Mar 7, 2023 at 9:32 AM Jeremiah D Jordan 
mailto:jeremiah.jor...@gmail.com>> wrote:
I agree with Paulo, it would be nice if we could figure out some way to make 
new NTS work correctly, with a parameter to fall back to the “bad” behavior, so 
that people restoring backups to a new cluster can get the right behavior to 
match their backups.
The problem with only fixing this in a new strategy is we have a ton of 
tutorials and docs out there which tell people to use NTS, so it would be great 
if we could keep “use NTS” as the recommendation.  Throwing a warning when 
someone uses NTS is kind of user hostile.  If someone just read some tutorial 
or doc which told them “make your key space this way” and then when they do 
that the database yells at them telling them they did it wrong, it is not a 
great experience.

-Jeremiah

> On Mar 7, 2023, at 10:16 AM, Benedict 
> mailto:bened...@apache.org>> wrote:
>
> My view is that if this is a pretty serious bug. I wonder if transactional 
> metadata will make it possible to safely fix this for users without 
> rebuilding (only via opt-in, of course).
>
>> On 7 Mar 2023, at 15:54, Miklosovic, Stefan 
>> mailto:stefan.mikloso...@netapp.com>> wrote:
>>
>> Thanks everybody for the feedback.
>>
>> I think that emitting a warning upon keyspace creation (and alteration) 
>> should be enough for starters. If somebody can not live without 100% bullet 
>> proof solution over time we might choose some approach from the offered 
>> ones. As the saying goes there is no silver bullet. If we decide to 
>> implement that new strategy, we would probably emit warnings anyway on NTS 
>> but it would be already done so just new strategy would be provided.
>>
>> 
>> From: Paulo Motta mailto:pauloricard...@gmail.com>>
>> Sent: Monday, March 6, 2023 17:48
>> To: dev@cassandra.apache.org
>> Subject: Re: Degradation of availability when using NTS and RF > number of 
>> racks
>>
>> NetApp Security WARNING: This is an external email. Do not click links or 
>> open attachments unless you recognize the sender and know the content is 
>> safe.
>>
>>
>>
>> It's a bit unfortunate that NTS does not maintain the ability to lose a rack 
>> without loss of quorum for RF > #racks > 2, since this can be easily 
>> achieved by evenly placing replicas across all racks.
>>
>> Since RackAwareTopologyStrategy is a superset of NetworkTopologyStrategy, 
>> can't we just use the new correct placement logic for newly created 
>> keyspaces instead of having a new strategy?
>>
>> The placement logic would be backwards-compatible for RF <= #racks. On 
>> upgrade, we could mark existing keyspaces with RF > #racks with 
>> use_legacy_replica_placement=true to maintain backwards compatibility and 
>> log a warning that the rack loss guarantee is not maintained for keyspaces 
>> created before the fix. Old keyspaces with RF <=#racks would still work with 
>> the new replica placement. The downside is that we would need to keep the 
>> old NTS logic around, or we could eventually deprecate it and require users 
>> to migrate keyspaces using the legacy placement strategy.
>>
>> Alternatively we could have RackAwareTopologyStrategy and fail NTS keyspace 
>> creation for RF > #racks and indicate users to use RackAwareTopologyStrategy 
>> to maintain the quorum guarantee on rack loss or set an override flag 
>> "support_quorum_on_rack_loss=false". This feels a bit iffy though since it 
>> could potentially confuse users about when to use each strategy.
>>
>> On Mon, Mar 6, 2023 at 5:51 AM Miklo

Re: Degradation of availability when using NTS and RF > number of racks

2023-03-07 Thread Jeff Jirsa
Anyone have stats on how many people use RF > 3 per dc? (I know what it looks like in my day job but I don’t want to pretend it’s representative of the larger community) I’m a fan of fixing this but I do wonder how common this is in the wild. On Mar 7, 2023, at 9:12 AM, Derek Chen-Becker  wrote:I think that the warning would only be thrown in the case where a potentially QUORUM-busting configuration is used. I think it would be a worse experience to not warn and let the user discover later when they can't write at QUORUM.Cheers,DerekOn Tue, Mar 7, 2023 at 9:32 AM Jeremiah D Jordan  wrote:I agree with Paulo, it would be nice if we could figure out some way to make new NTS work correctly, with a parameter to fall back to the “bad” behavior, so that people restoring backups to a new cluster can get the right behavior to match their backups.
The problem with only fixing this in a new strategy is we have a ton of tutorials and docs out there which tell people to use NTS, so it would be great if we could keep “use NTS” as the recommendation.  Throwing a warning when someone uses NTS is kind of user hostile.  If someone just read some tutorial or doc which told them “make your key space this way” and then when they do that the database yells at them telling them they did it wrong, it is not a great experience.

-Jeremiah

> On Mar 7, 2023, at 10:16 AM, Benedict  wrote:
> 
> My view is that if this is a pretty serious bug. I wonder if transactional metadata will make it possible to safely fix this for users without rebuilding (only via opt-in, of course).
> 
>> On 7 Mar 2023, at 15:54, Miklosovic, Stefan  wrote:
>> 
>> Thanks everybody for the feedback.
>> 
>> I think that emitting a warning upon keyspace creation (and alteration) should be enough for starters. If somebody can not live without 100% bullet proof solution over time we might choose some approach from the offered ones. As the saying goes there is no silver bullet. If we decide to implement that new strategy, we would probably emit warnings anyway on NTS but it would be already done so just new strategy would be provided.
>> 
>> 
>> From: Paulo Motta 
>> Sent: Monday, March 6, 2023 17:48
>> To: dev@cassandra.apache.org
>> Subject: Re: Degradation of availability when using NTS and RF > number of racks
>> 
>> NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>> 
>> 
>> 
>> It's a bit unfortunate that NTS does not maintain the ability to lose a rack without loss of quorum for RF > #racks > 2, since this can be easily achieved by evenly placing replicas across all racks.
>> 
>> Since RackAwareTopologyStrategy is a superset of NetworkTopologyStrategy, can't we just use the new correct placement logic for newly created keyspaces instead of having a new strategy?
>> 
>> The placement logic would be backwards-compatible for RF <= #racks. On upgrade, we could mark existing keyspaces with RF > #racks with use_legacy_replica_placement=true to maintain backwards compatibility and log a warning that the rack loss guarantee is not maintained for keyspaces created before the fix. Old keyspaces with RF <=#racks would still work with the new replica placement. The downside is that we would need to keep the old NTS logic around, or we could eventually deprecate it and require users to migrate keyspaces using the legacy placement strategy.
>> 
>> Alternatively we could have RackAwareTopologyStrategy and fail NTS keyspace creation for RF > #racks and indicate users to use RackAwareTopologyStrategy to maintain the quorum guarantee on rack loss or set an override flag "support_quorum_on_rack_loss=false". This feels a bit iffy though since it could potentially confuse users about when to use each strategy.
>> 
>> On Mon, Mar 6, 2023 at 5:51 AM Miklosovic, Stefan > wrote:
>> Hi all,
>> 
>> some time ago we identified an issue with NetworkTopologyStrategy. The problem is that when RF > number of racks, it may happen that NTS places replicas in such a way that when whole rack is lost, we lose QUORUM and data are not available anymore if QUORUM CL is used.
>> 
>> To illustrate this problem, lets have this setup:
>> 
>> 9 nodes in 1 DC, 3 racks, 3 nodes per rack. RF = 5. Then, NTS could place replicas like this: 3 replicas in rack1, 1 replica in rack2, 1 replica in rack3. Hence, when rack1 is lost, we do not have QUORUM.
>> 
>> It seems to us that there is already some logic around this scenario (1) but the implementation is not entirely correct. This solution is not computing the replica placement correctly so the above problem would be addressed.
>> 
>> We created a draft here (2, 3) which fixes it.
>> 
>> There is also a test which simulates this scenario. When I assign 256 

Re: [DISCUSSION] Cassandra + Java 17

2023-03-07 Thread Ekaterina Dimitrova
Thanks Benjamin, please, find below my comments.

"It is not necessarily a problem as long as we do get an issue with the
Modules boundaries and their access. For me it needs to be looked at on a
case by case basis."

We can still use the --add-opens/add-exports with Java 17(I mentioned I
added some as part of introducing experimental JDK17 support -
CASSANDRA-18258) so the concern is that we should do it as little as
possible as things can break at any time. JDK internals do not guarantee
backward compatibility. When something can break is unknown and we need to
be careful. We also had agreement on that in [1]

"It was used mainly to get around the fact that Java did not offer other
means to do certain things. Outside of trying to anticipate some of the
restrictions of that API and make sure that the JDK offers a suitable
replacement for us. I am not sure that there is much that we can do. But I
might misunderstand your question."
I think in some cases it was used for performance, too.
I think some people in our community might have more history around Unsafe.
I know there were different conversations in the past between members of
our community and the JDK community regarding Unsafe replacement in time
but I cannot find any final outcome so it is an honest question if anyone
has something more to share here.

Now in some of the cases we use internals I found there is fallback which
can still have some implications like being slower option. So there are
nuances and lots of history in decisions taken around the Cassandra
codebase, as usual. Regarding Unsafe, with JDK17 the only concern is with
Jamm so far because we do not use ObjectFieldOffset with lambdas
(implemented internally as hidden classes) in Cassandra code or at least  I
didn't find such a place so far.

Even if we decide to go with - "let's keep things as is we will look into
breakages in time", there needs to be visibility and awareness and
consideration at least when new code is added. I've heard different
opinions on the topic around the community, honestly - whether the code
should stay as-is and breakages to be addressed on a per case basis or not.
I do not see us having exact guidance. Thoughts?

[1] https://lists.apache.org/thread/33dt0c3kgskrzqtp4h8y411tqv2d6qvh

On Thu, 2 Mar 2023 at 7:48, Benjamin Lerer  wrote:

> Hey Ekaterina,
> Thanks for the update and all the work.
>
>
>> -- we also use setAccessible at numerous places.
>
>
> It is not necessarily a problem  as long as we do get an issue with the
> Modules boundaries and their access. For me it needs to be looked at on a
> case by case basis.
>
> - thoughts around the usage/future of Unsafe? History around the choice of
>> using it in C* and future plans I might not know of?
>>
>
> It was used mainly to get around the fact that Java did not offer other
> means to do certain things.
> Outside of trying to anticipate some of the restrictions of that API and
> make sure that the JDK offers a suitable replacement for us. I am not sure
> that there is much that we can do. But I might misunderstand your question.
>
> Le mer. 1 mars 2023 à 21:16, Ekaterina Dimitrova 
> a écrit :
>
>> Hi everyone,
>> Some updates and questions around JDK 17 below.
>> First of all, I wanted to let people know that currently Cassandra trunk
>> can be already compiled and run with J8 + J11 + J17. This is the product of
>> the realization that the feature branch makes it harder for working on
>> JDK17 related tickets due to the involvement of too many moving parts.
>> Agreement reached in [1] that new JDK introduction can be done
>> incrementally. Scripted UDFs removed, hooks to be added in a follow up
>> ticket.
>> What does this mean?
>> - Currently you can compile and run Cassandra trunk  with JDK 17(further
>> to 8+11). You can run unit and java distributed tests already with JDK17
>> - CASSANDRA-18106 in progress,  enabling CCM to handle JDK8, 11 and 17
>> with trunk and when that is ready we will be able to run also Python tests;
>> After that one lands it comes CASSANDRA-18247 ; its goal is to add CircleCI
>> config (separate of the one we have for 8+11) for 11+17 which can be used
>> from people who work on JDK17 related issues. Patch proposal already in the
>> ticket. Final version we will have when we do the switch 8+11 to 11+17,
>> things will go through evolution.
>>
>> What does this mean? Anyone who is interested to test or to help with
>> JDK17 effort can easily do it directly from trunk. Jenkins and CircleCI are
>> not switched from 8+11 to 11+17 until we are ready. Only test experimental
>> additional CircleCI config will be added, temporary to make it easier for
>> testing
>>
>> To remind you - the umbrella ticket for the JDK17 builds is
>> CASSANDRA-16895.
>> Good outstanding candidate still not assigned - CASSANDRA-18180, if
>> anyone has cycles, please, take a look at it. CASSANDRA-18263 might be also
>> of interest to someone.
>>
>> In other news, I added already to the JDK17 jvm option

Re: Degradation of availability when using NTS and RF > number of racks

2023-03-07 Thread Paulo Motta
I'm not sure if this recommendation is still valid (or ever was) but it's
not uncommon to have higher RF on system_auth keyspaces, where it would be
quite dramatic to hit this bug on the loss of a properly configured rack
for RF=3.

On Tue, Mar 7, 2023 at 2:40 PM Jeff Jirsa  wrote:

> Anyone have stats on how many people use RF > 3 per dc? (I know what it
> looks like in my day job but I don’t want to pretend it’s representative of
> the larger community)
>
> I’m a fan of fixing this but I do wonder how common this is in the wild.
>
> On Mar 7, 2023, at 9:12 AM, Derek Chen-Becker 
> wrote:
>
> 
> I think that the warning would only be thrown in the case where a
> potentially QUORUM-busting configuration is used. I think it would be a
> worse experience to not warn and let the user discover later when they
> can't write at QUORUM.
>
> Cheers,
>
> Derek
>
> On Tue, Mar 7, 2023 at 9:32 AM Jeremiah D Jordan <
> jeremiah.jor...@gmail.com> wrote:
>
>> I agree with Paulo, it would be nice if we could figure out some way to
>> make new NTS work correctly, with a parameter to fall back to the “bad”
>> behavior, so that people restoring backups to a new cluster can get the
>> right behavior to match their backups.
>> The problem with only fixing this in a new strategy is we have a ton of
>> tutorials and docs out there which tell people to use NTS, so it would be
>> great if we could keep “use NTS” as the recommendation.  Throwing a warning
>> when someone uses NTS is kind of user hostile.  If someone just read some
>> tutorial or doc which told them “make your key space this way” and then
>> when they do that the database yells at them telling them they did it
>> wrong, it is not a great experience.
>>
>> -Jeremiah
>>
>> > On Mar 7, 2023, at 10:16 AM, Benedict  wrote:
>> >
>> > My view is that if this is a pretty serious bug. I wonder if
>> transactional metadata will make it possible to safely fix this for users
>> without rebuilding (only via opt-in, of course).
>> >
>> >> On 7 Mar 2023, at 15:54, Miklosovic, Stefan <
>> stefan.mikloso...@netapp.com> wrote:
>> >>
>> >> Thanks everybody for the feedback.
>> >>
>> >> I think that emitting a warning upon keyspace creation (and
>> alteration) should be enough for starters. If somebody can not live without
>> 100% bullet proof solution over time we might choose some approach from the
>> offered ones. As the saying goes there is no silver bullet. If we decide to
>> implement that new strategy, we would probably emit warnings anyway on NTS
>> but it would be already done so just new strategy would be provided.
>> >>
>> >> 
>> >> From: Paulo Motta 
>> >> Sent: Monday, March 6, 2023 17:48
>> >> To: dev@cassandra.apache.org
>> >> Subject: Re: Degradation of availability when using NTS and RF >
>> number of racks
>> >>
>> >> NetApp Security WARNING: This is an external email. Do not click links
>> or open attachments unless you recognize the sender and know the content is
>> safe.
>> >>
>> >>
>> >>
>> >> It's a bit unfortunate that NTS does not maintain the ability to lose
>> a rack without loss of quorum for RF > #racks > 2, since this can be easily
>> achieved by evenly placing replicas across all racks.
>> >>
>> >> Since RackAwareTopologyStrategy is a superset of
>> NetworkTopologyStrategy, can't we just use the new correct placement logic
>> for newly created keyspaces instead of having a new strategy?
>> >>
>> >> The placement logic would be backwards-compatible for RF <= #racks. On
>> upgrade, we could mark existing keyspaces with RF > #racks with
>> use_legacy_replica_placement=true to maintain backwards compatibility and
>> log a warning that the rack loss guarantee is not maintained for keyspaces
>> created before the fix. Old keyspaces with RF <=#racks would still work
>> with the new replica placement. The downside is that we would need to keep
>> the old NTS logic around, or we could eventually deprecate it and require
>> users to migrate keyspaces using the legacy placement strategy.
>> >>
>> >> Alternatively we could have RackAwareTopologyStrategy and fail NTS
>> keyspace creation for RF > #racks and indicate users to use
>> RackAwareTopologyStrategy to maintain the quorum guarantee on rack loss or
>> set an override flag "support_quorum_on_rack_loss=false". This feels a bit
>> iffy though since it could potentially confuse users about when to use each
>> strategy.
>> >>
>> >> On Mon, Mar 6, 2023 at 5:51 AM Miklosovic, Stefan <
>> stefan.mikloso...@netapp.com> wrote:
>> >> Hi all,
>> >>
>> >> some time ago we identified an issue with NetworkTopologyStrategy. The
>> problem is that when RF > number of racks, it may happen that NTS places
>> replicas in such a way that when whole rack is lost, we lose QUORUM and
>> data are not available anymore if QUORUM CL is used.
>> >>
>> >> To illustrate this problem, lets have this setup:
>> >>
>> >> 9 nodes in 1 DC, 3 racks, 3 nodes per rack. R

Re: Degradation of availability when using NTS and RF > number of racks

2023-03-07 Thread Miklosovic, Stefan
I am forwarding the message Ben Slater wrote me personally and asked me to post 
it. He has some problems with the subscription to this mailing list with his 
email.

Very uncommon in my experience – my guess would be at most 2 to 3 cluster out 
of the few hundred that we manage.

Also picking up on one of your comments earlier in the thread – " like anyone 
running RF=3 in AWS us-west-1 (or any other region with only 2 accessible 
AZs)": in the situation with RF3 and two racks/AZs the current NTS behaviour is 
no worse than the best logical case of 2 replicas in 1 rack. This issue is 
really only a problem with RF5 and 3 AZs where you can end up with 3 replicas 
in one AZ and then lose quorum with failure of that AZ.

We currently work around this issue by ensuring we define 5 racks if targeting 
RF5 in a region with less than 5 AZs but having multiple logical racks point to 
the same AZs (we also do the same with RF3 and 2 AZs because it makes some 
management ops simpler to have consistency).

Paulo made a good point about the system_auth case although now I think about 
it, it probably doesn’t impact there because I think system_auth is queried at 
LOCAL_ONE and high RF on system auth was more about replicating to lots of 
nodes than distributing across racks

Cheers
Ben


From: Paulo Motta 
Sent: Tuesday, March 7, 2023 21:43
To: dev@cassandra.apache.org
Subject: Re: Degradation of availability when using NTS and RF > number of racks

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



I'm not sure if this recommendation is still valid (or ever was) but it's not 
uncommon to have higher RF on system_auth keyspaces, where it would be quite 
dramatic to hit this bug on the loss of a properly configured rack for RF=3.

On Tue, Mar 7, 2023 at 2:40 PM Jeff Jirsa 
mailto:jji...@gmail.com>> wrote:
Anyone have stats on how many people use RF > 3 per dc? (I know what it looks 
like in my day job but I don’t want to pretend it’s representative of the 
larger community)

I’m a fan of fixing this but I do wonder how common this is in the wild.

On Mar 7, 2023, at 9:12 AM, Derek Chen-Becker 
mailto:de...@chen-becker.org>> wrote:


I think that the warning would only be thrown in the case where a potentially 
QUORUM-busting configuration is used. I think it would be a worse experience to 
not warn and let the user discover later when they can't write at QUORUM.

Cheers,

Derek

On Tue, Mar 7, 2023 at 9:32 AM Jeremiah D Jordan 
mailto:jeremiah.jor...@gmail.com>> wrote:
I agree with Paulo, it would be nice if we could figure out some way to make 
new NTS work correctly, with a parameter to fall back to the “bad” behavior, so 
that people restoring backups to a new cluster can get the right behavior to 
match their backups.
The problem with only fixing this in a new strategy is we have a ton of 
tutorials and docs out there which tell people to use NTS, so it would be great 
if we could keep “use NTS” as the recommendation.  Throwing a warning when 
someone uses NTS is kind of user hostile.  If someone just read some tutorial 
or doc which told them “make your key space this way” and then when they do 
that the database yells at them telling them they did it wrong, it is not a 
great experience.

-Jeremiah

> On Mar 7, 2023, at 10:16 AM, Benedict 
> mailto:bened...@apache.org>> wrote:
>
> My view is that if this is a pretty serious bug. I wonder if transactional 
> metadata will make it possible to safely fix this for users without 
> rebuilding (only via opt-in, of course).
>
>> On 7 Mar 2023, at 15:54, Miklosovic, Stefan 
>> mailto:stefan.mikloso...@netapp.com>> wrote:
>>
>> Thanks everybody for the feedback.
>>
>> I think that emitting a warning upon keyspace creation (and alteration) 
>> should be enough for starters. If somebody can not live without 100% bullet 
>> proof solution over time we might choose some approach from the offered 
>> ones. As the saying goes there is no silver bullet. If we decide to 
>> implement that new strategy, we would probably emit warnings anyway on NTS 
>> but it would be already done so just new strategy would be provided.
>>
>> 
>> From: Paulo Motta mailto:pauloricard...@gmail.com>>
>> Sent: Monday, March 6, 2023 17:48
>> To: dev@cassandra.apache.org
>> Subject: Re: Degradation of availability when using NTS and RF > number of 
>> racks
>>
>> NetApp Security WARNING: This is an external email. Do not click links or 
>> open attachments unless you recognize the sender and know the content is 
>> safe.
>>
>>
>>
>> It's a bit unfortunate that NTS does not maintain the ability to lose a rack 
>> without loss of quorum for RF > #racks > 2, since this can be easily 
>> achieved by evenly placing replicas across all racks.
>>
>> Since RackAwareTopologyStrategy is a supers

Re: Degradation of availability when using NTS and RF > number of racks

2023-03-07 Thread Miklosovic, Stefan
I forgot to remove the last paragraph. We really do some queries with QUORUM on 
system_auth.

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/auth/CassandraRoleManager.java#L277-L291


From: Miklosovic, Stefan 
Sent: Tuesday, March 7, 2023 22:37
To: dev@cassandra.apache.org
Subject: Re: Degradation of availability when using NTS and RF > number of racks

I am forwarding the message Ben Slater wrote me personally and asked me to post 
it. He has some problems with the subscription to this mailing list with his 
email.

Very uncommon in my experience – my guess would be at most 2 to 3 cluster out 
of the few hundred that we manage.

Also picking up on one of your comments earlier in the thread – " like anyone 
running RF=3 in AWS us-west-1 (or any other region with only 2 accessible 
AZs)": in the situation with RF3 and two racks/AZs the current NTS behaviour is 
no worse than the best logical case of 2 replicas in 1 rack. This issue is 
really only a problem with RF5 and 3 AZs where you can end up with 3 replicas 
in one AZ and then lose quorum with failure of that AZ.

We currently work around this issue by ensuring we define 5 racks if targeting 
RF5 in a region with less than 5 AZs but having multiple logical racks point to 
the same AZs (we also do the same with RF3 and 2 AZs because it makes some 
management ops simpler to have consistency).

Paulo made a good point about the system_auth case although now I think about 
it, it probably doesn’t impact there because I think system_auth is queried at 
LOCAL_ONE and high RF on system auth was more about replicating to lots of 
nodes than distributing across racks

Cheers
Ben


From: Paulo Motta 
Sent: Tuesday, March 7, 2023 21:43
To: dev@cassandra.apache.org
Subject: Re: Degradation of availability when using NTS and RF > number of racks

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



I'm not sure if this recommendation is still valid (or ever was) but it's not 
uncommon to have higher RF on system_auth keyspaces, where it would be quite 
dramatic to hit this bug on the loss of a properly configured rack for RF=3.

On Tue, Mar 7, 2023 at 2:40 PM Jeff Jirsa 
mailto:jji...@gmail.com>> wrote:
Anyone have stats on how many people use RF > 3 per dc? (I know what it looks 
like in my day job but I don’t want to pretend it’s representative of the 
larger community)

I’m a fan of fixing this but I do wonder how common this is in the wild.

On Mar 7, 2023, at 9:12 AM, Derek Chen-Becker 
mailto:de...@chen-becker.org>> wrote:


I think that the warning would only be thrown in the case where a potentially 
QUORUM-busting configuration is used. I think it would be a worse experience to 
not warn and let the user discover later when they can't write at QUORUM.

Cheers,

Derek

On Tue, Mar 7, 2023 at 9:32 AM Jeremiah D Jordan 
mailto:jeremiah.jor...@gmail.com>> wrote:
I agree with Paulo, it would be nice if we could figure out some way to make 
new NTS work correctly, with a parameter to fall back to the “bad” behavior, so 
that people restoring backups to a new cluster can get the right behavior to 
match their backups.
The problem with only fixing this in a new strategy is we have a ton of 
tutorials and docs out there which tell people to use NTS, so it would be great 
if we could keep “use NTS” as the recommendation.  Throwing a warning when 
someone uses NTS is kind of user hostile.  If someone just read some tutorial 
or doc which told them “make your key space this way” and then when they do 
that the database yells at them telling them they did it wrong, it is not a 
great experience.

-Jeremiah

> On Mar 7, 2023, at 10:16 AM, Benedict 
> mailto:bened...@apache.org>> wrote:
>
> My view is that if this is a pretty serious bug. I wonder if transactional 
> metadata will make it possible to safely fix this for users without 
> rebuilding (only via opt-in, of course).
>
>> On 7 Mar 2023, at 15:54, Miklosovic, Stefan 
>> mailto:stefan.mikloso...@netapp.com>> wrote:
>>
>> Thanks everybody for the feedback.
>>
>> I think that emitting a warning upon keyspace creation (and alteration) 
>> should be enough for starters. If somebody can not live without 100% bullet 
>> proof solution over time we might choose some approach from the offered 
>> ones. As the saying goes there is no silver bullet. If we decide to 
>> implement that new strategy, we would probably emit warnings anyway on NTS 
>> but it would be already done so just new strategy would be provided.
>>
>> 
>> From: Paulo Motta mailto:pauloricard...@gmail.com>>
>> Sent: Monday, March 6, 2023 17:48
>> To: dev@cassandra.apache.org
>> Subject: Re: Degradation of availability when using NTS and RF > number of 
>> racks
>>
>> Net

Re: [jira] [Updated] (CASSANDRA-18305) Enhance nodetool compactionstats with existing MBean metrics

2023-03-07 Thread guo Maxwell
Had a similar need early and have been trying to solve it. Looking forward
to this patch.

Brad Schoening (Jira) 于2023年3月8日 周三下午12:15写道:

>
>  [
> https://issues.apache.org/jira/browse/CASSANDRA-18305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
>
> Brad Schoening updated CASSANDRA-18305:
> ---
> Description:
> Nodetool compactionstats reports only on active compactions, if nothing is
> active, you see only:
> {quote}$nodetool compactionstats
>
> pending tasks: 0
> {quote}
> but in the MBean Compaction/TotalCompactionsCompleted there are recent
> statistic in events/second for:
>  * Count
>  * FifteenMinueRate
>  * FiveMinueRate
>  * MeanRate
>  * OneMinuteRate
>
> It would be useful to see in addition:
> {quote}pending tasks: 0
>
> compactions completed: 20
>
> 1 minute rate:0/second
>
>5 minute rate:2.3/second
>
>   15 minute rate:   4.6/second
> {quote}
> Also, since compaction_throughput_mb_per_sec is a throttling parameter in
> cassandra.yaml, it would be nice to show the actual compaction throughput
> and be able to observe if you're close to the limit.  I.e.,
> {quote}compaction throughput 13.2 MBps / 16 MBps (82.5%)
> {quote}
>
>   was:
> Nodetool compactionstats reports only on active compactions, if nothing is
> active, you see only:
> {quote}$nodetool compactionstats
>
> pending tasks: 0
> {quote}
> but in the MBean Compaction/TotalCompactionsCompleted there are recent
> statistic in events/second for:
>  * Count
>  * FifteenMinueRate
>  * FiveMinueRate
>  * MeanRate
>  * OneMinuteRate
>
> It would be useful to see in addition:
> {quote}pending tasks: 0
>
> compactions completed: 20
>
> 1 minute rate:0/second
>
>5 minute rate:2.3/second
>
>   15 minute rate:   4.6/second
> {quote}
> Also, since compaction_throughput_mb_per_sec is a throttling parameter in
> cassandra.yaml, it would be nice to report the actual compaction throughput
> and be able to observe if you're close to the limit.  I.e.,
>
>   compaction throughput 13.2 MBps / 16 MBps (82.5%)
>
>
> > Enhance nodetool compactionstats with existing MBean metrics
> > 
> >
> > Key: CASSANDRA-18305
> > URL:
> https://issues.apache.org/jira/browse/CASSANDRA-18305
> > Project: Cassandra
> >  Issue Type: Improvement
> >  Components: Tool/nodetool
> >Reporter: Brad Schoening
> >Assignee: Stefan Miklosovic
> >Priority: Normal
> >
> > Nodetool compactionstats reports only on active compactions, if nothing
> is active, you see only:
> > {quote}$nodetool compactionstats
> > pending tasks: 0
> > {quote}
> > but in the MBean Compaction/TotalCompactionsCompleted there are recent
> statistic in events/second for:
> >  * Count
> >  * FifteenMinueRate
> >  * FiveMinueRate
> >  * MeanRate
> >  * OneMinuteRate
> > It would be useful to see in addition:
> > {quote}pending tasks: 0
> > compactions completed: 20
> > 1 minute rate:0/second
> >5 minute rate:2.3/second
> >   15 minute rate:   4.6/second
> > {quote}
> > Also, since compaction_throughput_mb_per_sec is a throttling parameter
> in cassandra.yaml, it would be nice to show the actual compaction
> throughput and be able to observe if you're close to the limit.  I.e.,
> > {quote}compaction throughput 13.2 MBps / 16 MBps (82.5%)
> > {quote}
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.20.10#820010)
>
> -
> To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: commits-h...@cassandra.apache.org
>
> --
you are the apple of my eye !