Nothing obvious from the logs you posted.

Generally speaking, replaying commit log is often the culprit when a node takes a long time to start. I have seen many nodes with large memtable and commit log size limit spending over half an hour replaying the commit log. I usually do a "nodetool flush" before shutting down the node to help speed up the start time if the shutdown was planned. There isn't much you can do about unexpected shutdown, such as server crashes. When that happens, the only reasonable thing to do is wait for the commit log replay to finish. You should see log entries related to replaying commit logs if this is the case.

However, if you don't find any logs related to replaying commit logs, the cause may be completely different.


On 19/01/2022 11:54, Paul Chandler wrote:
Hi all,

We have upgraded a couple of clusters from 3.11.6, now we are having issues 
when we restart the nodes.

The node will either hang or take 10-30 minute to restart, these are the last 
messages we have in the system.log:

INFO  [NonPeriodicTasks:1] 2022-01-19 10:08:23,267  FileUtils.java:545 - 
Deleting file during startup: 
/var/lib/cassandra/data/system/table_estimates-176c39cdb93d33a5a2188eb06a56f66e/nb-184-big-Summary.db
INFO  [NonPeriodicTasks:1] 2022-01-19 10:08:23,268  LogTransaction.java:240 - 
Unfinished transaction log, deleting 
/var/lib/cassandra/data/system/table_estimates-176c39cdb93d33a5a2188eb06a56f66e/nb-185-big-Data.db
INFO  [NonPeriodicTasks:1] 2022-01-19 10:08:23,268  FileUtils.java:545 - 
Deleting file during startup: 
/var/lib/cassandra/data/system/table_estimates-176c39cdb93d33a5a2188eb06a56f66e/nb-185-big-Summary.db
INFO  [NonPeriodicTasks:1] 2022-01-19 10:08:23,269  LogTransaction.java:240 - 
Unfinished transaction log, deleting 
/var/lib/cassandra/data/system/table_estimates-176c39cdb93d33a5a2188eb06a56f66e/nb-186-big-Data.db
INFO  [NonPeriodicTasks:1] 2022-01-19 10:08:23,270  FileUtils.java:545 - 
Deleting file during startup: 
/var/lib/cassandra/data/system/table_estimates-176c39cdb93d33a5a2188eb06a56f66e/nb-186-big-Summary.db
INFO  [NonPeriodicTasks:1] 2022-01-19 10:08:23,272  LogTransaction.java:240 - 
Unfinished transaction log, deleting 
/var/lib/cassandra/data/system/table_estimates-176c39cdb93d33a5a2188eb06a56f66e/nb_txn_unknowncompactiontype_bc501d00-790f-11ec-9f80-85
8854746758.log
INFO  [MemtableFlushWriter:2] 2022-01-19 10:08:23,289  LogTransaction.java:240 
- Unfinished transaction log, deleting 
/var/lib/cassandra/data/system/local-7ad54392bcdd35a684174e047860b377/nb_txn_flush_bc52dc20-790f-11ec-9f80-858854746758.log

The debug log has messages from DiskBoundaryManager.java at the same time, then 
it just has the following messages:||

DEBUG [ScheduledTasks:1] 2022-01-19 10:28:09,430  SSLFactory.java:354 - 
Checking whether certificates have been updated []
DEBUG [ScheduledTasks:1] 2022-01-19 10:38:09,431  SSLFactory.java:354 - 
Checking whether certificates have been updated []
DEBUG [ScheduledTasks:1] 2022-01-19 10:48:09,431  SSLFactory.java:354 - 
Checking whether certificates have been updated []
DEBUG [ScheduledTasks:1] 2022-01-19 10:58:09,431  SSLFactory.java:354 - 
Checking whether certificates have been updated []


It seems to get worse after each restart, and then it gets to the state where 
it just hangs, then the only thing to do is to re bootstrap the node.

Once I had re bootstrapped all the nodes in the cluster, I thought the cluster 
was stable, but I have now got the case where the one of the nodes is hanging 
again.

Does anyone have an ideas what is causing the problems ?


Thanks

Paul Chandler

Reply via email to