I haven't made any schema modifications for a year or more. This problem came up during a "normal day of work" for Cassandra.
Il giorno lun 1 mar 2021 alle ore 16:25 Bowen Song <bo...@bso.ng.invalid> ha scritto: > Your missing keyspace problem has nothing to do with that bug. > > In that case, the same table was created twice in a very short period of > time, and I suspect that was done concurrently on two different nodes. The > evidence lies in the two CF IDs - bd7200a0156711e88974855d74ee356f and > bd750de0156711e8bdc54f7bcdcb851f, which are created at > 2018-02-19T11:26:33.898 and 2018-02-19T11:26:33.918 respectively, with a > merely 20 milliseconds gap between them. > > TBH, It doesn't sound like a bug to me. Cassandra is eventually consistent > by design, and two conflicting schema changes on two different nodes at > nearly the same time will likely result in schema disagreement and > Cassandra will eventually reach agreement again, and possibly discarding > one of the conflicting schema change, together with all data written to the > discarded table/columns. To make sure this doesn't happen to your data, you > should avoid doing multiple schema changes to the same keyspace (for > create/alter/... keyspace) or same table (for create/alter/... table) on > two or more Cassandra coordinator nodes in a very short period of time. > Instead, send all your schema change queries to the same coordinator node, > or if that's not possible, wait for at least 30 seconds between two schema > changes and make sure you aren't restarting any node at the same time. > > On 01/03/2021 14:04, Marco Gasparini wrote: > > actually I found a lot of .db files in the following directory: > > /var/lib/cassandra/data/mykespace/mytable-2795c0204a2d11e9aba361828766468f/snapshots/dropped-1614575293790-mytable > > I also found this: > 2021-03-01 06:08:08,864 INFO [Native-Transport-Requests-1] > MigrationManager.java:542 announceKeyspaceDrop Drop Keyspace 'mykeyspace' > > so I think that you, @erick and @bowen, are right. Something dropped the > keyspace. > > I will try to follow your procedure @bowen, thank you very much! > > Do you know what could cause this issue? > It seems like a big issue. I found this bug > https://issues.apache.org/jira/browse/CASSANDRA-14957?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel, > maybe they are correlated... > > Thank you @Bowen and @Erick > > > > > > Il giorno lun 1 mar 2021 alle ore 13:39 Bowen Song <bo...@bso.ng.invalid> > <bo...@bso.ng.invalid> ha scritto: > >> The warning message indicates the node y.y.y.y went down (or is >> unreachable via network) before 2021-02-28 05:17:33. Is there any chance >> you can find the log file on that node at around or before that time? It >> may show why did that node go down. The reason of that might be irrelevant >> to the missing keyspace, but still worth to have a look in order to prevent >> the same thing from happening again. >> >> As Erick said, the table's CF ID isn't new, so it's unlikely to be a >> schema synchronization issue. Therefore I also suspect the keyspace was >> accidentally dropped. Cassandra only logs "Drop Keyspace 'keyspace_name'" >> on the node that received the "DROP KEYSPACE ..." query, so you may have to >> search this in log files from all nodes to find it. >> >> Assuming the keyspace was dropped but you still have the SSTable files, >> you can recover the data by re-creating the keyspace and tables with >> identical replication strategy and schema, then copy the SSTable files to >> the corresponding new table directories (with different CF ID suffixes) on >> the same node, and finally run "nodetool refresh ..." or restart the node. >> Since you don't yet have a full backup, I strongly recommend you to make a >> backup, and ideally test restoring it to a different cluster, before >> attempting to do this. >> >> >> On 01/03/2021 11:48, Marco Gasparini wrote: >> >> here the previous error: >> >> 2021-02-28 05:17:33,262 WARN NodeConnectionsService.java:165 >> validateAndConnectIfNeeded failed to connect to node >> {y.y.y.y}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{y.y.y.y >> }{ y.y.y.y :9300}{ALIVE}{rack=r1, dc=DC1} (tried [1] times) >> org.elasticsearch.transport.ConnectTransportException: [ y.y.y.y ][ >> y.y.y.y :9300] connect_timeout[30s] >> at >> org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:163) >> at >> org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:616) >> at >> org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:513) >> at >> org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:336) >> at >> org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:323) >> at >> org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:156) >> at >> org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:185) >> at >> org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) >> at >> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >> at java.lang.Thread.run(Thread.java:748) >> >> Yes this node (y.y.y.y) stopped because it went out of disk space. >> >> >> I said "deleted" because I'm not a native english speaker :) >> I usually "remove" snapshots via 'nodetool clearsnapshot' or >> cassandra-reaper user interface. >> >> >> >> >> Il giorno lun 1 mar 2021 alle ore 12:39 Bowen Song <bo...@bso.ng.invalid> >> <bo...@bso.ng.invalid> ha scritto: >> >>> What was the warning? Is it related to the disk failure policy? Could >>> you please share the relevant log? You can edit it and redact the sensitive >>> information before sharing it. >>> >>> Also, I can't help to notice that you used the word "delete" (instead of >>> "clear") to describe the process of removing snapshots. May I ask how did >>> you delete the snapshots? Was it "nodetool clearsnapshot ...", "rm -rf ..." >>> or something else? >>> >>> >>> On 01/03/2021 11:27, Marco Gasparini wrote: >>> >>> thanks Bowen for answering >>> >>> Actually, I checked the server log and the only warning was that a node >>> went offline. >>> No, I have no backups or snapshots. >>> >>> In the meantime I found that probably Cassandra moved all files from a >>> directory to the snapshot directory. I am pretty sure of that because I >>> have recently deleted all the snapshots I made because it was going out of >>> disk space and I found this very directory full of files where the >>> modification timestamp was the same as the first error I got in the log. >>> >>> >>> >>> Il giorno lun 1 mar 2021 alle ore 12:13 Bowen Song >>> <bo...@bso.ng.invalid> <bo...@bso.ng.invalid> ha scritto: >>> >>>> The first thing I'd check is the server log. The log may contain vital >>>> information about the cause of it, and that there may be different ways to >>>> recover from it depending on the cause. >>>> >>>> Also, please allow me to ask a seemingly obvious question, do you have >>>> a backup? >>>> >>>> >>>> On 01/03/2021 09:34, Marco Gasparini wrote: >>>> >>>> hello everybody, >>>> >>>> This morning, Monday!!!, I was checking on Cassandra cluster and I >>>> noticed that all data was missing. I noticed the following error on each >>>> node (9 nodes in the cluster): >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> *2021-03-01 09:05:52,984 WARN [MessagingService-Incoming-/x.x.x.x] >>>> IncomingTcpConnection.java:103 run UnknownColumnFamilyException reading >>>> from socket; closing org.apache.cassandra.db.UnknownColumnFamilyException: >>>> Couldn't find table for cfId cba90a70-5c46-11e9-9e36-f54fe3235e69. If a >>>> table was just created, this is likely due to the schema not being fully >>>> propagated. Please wait for schema agreement on table creation. at >>>> org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1533) >>>> at >>>> org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:758) >>>> at >>>> org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:697) >>>> at >>>> org.apache.cassandra.io.ForwardingVersionedSerializer.deserialize(ForwardingVersionedSerializer.java:50) >>>> at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123) >>>> at >>>> org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195) >>>> at >>>> org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183) >>>> at >>>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94)* >>>> >>>> I tried to query the keyspace and got this: >>>> >>>> node1# cqlsh >>>> Connected to Cassandra Cluster at x.x.x.x:9042. >>>> [cqlsh 5.0.1 | Cassandra 3.11.5.1 | CQL spec 3.4.4 | Native protocol v4] >>>> Use HELP for help. >>>> cqlsh> select * from mykeyspace.mytable where id = 123935; >>>> *InvalidRequest: Error from server: code=2200 [Invalid query] >>>> message="Keyspace * *mykeyspace does not exist"* >>>> >>>> Investigating on each node I found that all the *SStables exist*, so I >>>> think data is still there but the keyspace vanished, "magically". >>>> >>>> Other facts I can tell you are: >>>> >>>> - I have been getting Anticompaction errors from 2 nodes due to the >>>> fact the disk was almost full. >>>> - the cluster was online friday >>>> - this morning, Monday, the whole cluster was offline and I noticed >>>> the problem of "missing keyspace" >>>> - During the weekend the cluster has been subject to inserts and >>>> deletes >>>> - I have a 9 node (HDD) Cassandra 3.11 cluster. >>>> >>>> I really need help on this, how can I restore the cluster? >>>> >>>> Thank you very much >>>> Marco >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>>