Re: Potential issues during 4.0 upgrade

2021-08-23 Thread Scott Andreas
Thank you for raising this, Sam!

Agreed this is a bug that warrants releasing 4.0.1 and notifying user@.

To elaborate on impact, this issue can produce a state in rolling 3.x -> 4.0 
upgrades in which 4.0 nodes fail to serialize gossip state during the shadow 
round once the size of this state exceeds 128kb. This prevents new instances 
from coming up. Once in this state, it is also not possible for new instances 
to start up and join the ring. If existing 4.0 instances restart, they will 
also be unable to gossip and remain down.

It's a pretty serious situation without an obvious way out aside from deploying 
this patch. We should get a new release out quickly.

– Scott


From: Sam Tunnicliffe 
Sent: Monday, August 23, 2021 11:27 AM
To: dev@cassandra.apache.org
Subject: Potential issues during 4.0 upgrade

Hi all,

I just opened a JIRA which is relevant to those running large clusters (around 
the 400 node range) and who have plans to upgrade to 4.0 upgrades soon.

https://issues.apache.org/jira/browse/CASSANDRA-16877 


The issue is that in large clusters, the size of gossip messages sent when a 
node (re)starts may exceed the hard limit of the urgent message channel. This 
causes an error on the sender and ultimately the message is dropped. This in 
turn can cause startup failures and/or partial loss of availability.

Fortunately, the fix is quite simple and I’ve submitted a patch that I and 
other contributors have been running since discovering this issue and can 
confirm resolves the problem. It would be great to get it reviewed and merged 
ASAP and then cut a 4.0.1 release. In the meantime, it may be wise to suggest 
that operators of large clusters hold off on any planned 4.0 upgrades.

Thanks,
Sam


-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Potential issues during 4.0 upgrade

2021-08-23 Thread Sam Tunnicliffe
Hi all,

I just opened a JIRA which is relevant to those running large clusters (around 
the 400 node range) and who have plans to upgrade to 4.0 upgrades soon. 

https://issues.apache.org/jira/browse/CASSANDRA-16877 
 

The issue is that in large clusters, the size of gossip messages sent when a 
node (re)starts may exceed the hard limit of the urgent message channel. This 
causes an error on the sender and ultimately the message is dropped. This in 
turn can cause startup failures and/or partial loss of availability.  

Fortunately, the fix is quite simple and I’ve submitted a patch that I and 
other contributors have been running since discovering this issue and can 
confirm resolves the problem. It would be great to get it reviewed and merged 
ASAP and then cut a 4.0.1 release. In the meantime, it may be wise to suggest 
that operators of large clusters hold off on any planned 4.0 upgrades.

Thanks,
Sam



Re: [VOTE] CEP-11: Pluggable memtable implementations

2021-08-23 Thread Stefania Alborghetti
+1

On Fri, Aug 20, 2021 at 3:41 AM Sam Tunnicliffe  wrote:

> +1
>
> > On 19 Aug 2021, at 17:10, Branimir Lambov  wrote:
> >
> > Hello everyone,
> >
> > I am proposing the CEP-11 (Pluggable memtable implementations) for
> adoption
> >
> > Discussion thread:
> >
> https://lists.apache.org/thread.html/rb5e950f882196764744c31bc3c13dfbf0603cb9f8bc2f6cfb976d285%40%3Cdev.cassandra.apache.org%3E
> >
> >
> > The vote will be open for 72 hours.
> > Votes by PMC members are considered binding.
> > A vote passes if there are at least three binding +1s and no binding
> vetoes.
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>

-- 
[image: LinkedIn Logo]

[image: Facebook Logo]

[image: Twitter Logo]    [image: RSS Feed]
   [image: Github Logo]

Find DataStax Online:
[image: DataStax Logo Square] 
*Stefania Alborghetti*
Cloud engineering

+1 650 389 6000 <16503896000> | datastax.com 


Re: Google Summer of Code Wrap Up: Add TTL support to nodetool snapshots

2021-08-23 Thread Benjamin Lerer
Great work Abi, Paulo and Stefan!

Le sam. 21 août 2021 à 20:10, Ekaterina Dimitrova  a
écrit :

> Thank you for all your time and efforts Abi, Paulo and Stefan!
>
> Abi, I hope you had also some fun around the work done and this was only
> the beginning of a continuous collaboration with the community.
>
> On Fri, 20 Aug 2021 at 11:50, Jonathan Ellis  wrote:
>
> > Thank you, Abi!  And thanks to Stefan and Paulo for mentoring!
> >
> > On Fri, Aug 20, 2021 at 10:43 AM Paulo Motta  wrote:
> >
> > > Hi everyone,
> > >
> > > Just a heads up to the community that we're wrapping up the Google
> Summer
> > > of Code project this year.
> > >
> > > Abi Palagashvili worked with us in the last couple of months to provide
> > TTL
> > > support to nodetool snapshots on CASSANDRA-16789 <
> > > https://issues.apache.org/jira/browse/CASSANDRA-16789>, under mine and
> > > Stefan Miklosovic's mentorship. We're in the final round of review
> before
> > > merging the feature and welcome anyone who wants to take a look in the
> > > final patch and give feedback.
> > >
> > > After this change is shipped in the next major release, clients can
> > supply
> > > an optional --ttl parameter during nodetool snapshot creation and
> > Cassandra
> > > will automatically clean up expired snapshots, avoiding the need for
> > > external management of snapshot cleanup.
> > >
> > > During the process of adding this feature we identified several
> > improvement
> > > areas and started an effort to modernize the snapshot module by
> > > centralizing snapshot lifecycle management on a SnapshotManager class,
> > > which is responsible for keeping track of active snapshots in memory
> and
> > > periodically cleaning them up when they expire. Right now we're only
> > > managing "expiring" snapshots in this class, but we plan to migrate the
> > > legacy snapshot lifecycle management to this class in follow-up tickets
> > to
> > > decouple it from the keyspace and table management classes. We
> > > significantly increased the test coverage of the snapshot lifecycle and
> > > added in-jvm tests to verify the feature.
> > >
> > > We plan to extend this feature before it's released on 4.1 by providing
> > > support to pause/resume snapshot cleanup and also allow clients to
> supply
> > > TTL to auto snapshots (those optionally created during truncation,
> table
> > > drop or compaction), as well as integrate it with the ability of
> clearing
> > > snapshots created since a specific date <
> > > https://issues.apache.org/jira/browse/CASSANDRA-16860>. The parent
> task
> > to
> > > track future improvements in this area is CASSANDRA-16451 <
> > > https://issues.apache.org/jira/browse/CASSANDRA-16451>.
> > >
> > > We thank Abi very much for his effort during the project and hope he
> > stays
> > > around in the community!
> > >
> > > Kind Regards,
> > >
> > > Paulo
> > >
> >
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
> >
>