The warning message indicates the node y.y.y.y went down (or is
unreachable via network) before 2021-02-28 05:17:33. Is there any chance
you can find the log file on that node at around or before that time? It
may show why did that node go down. The reason of that might be
irrelevant to the missing keyspace, but still worth to have a look in
order to prevent the same thing from happening again.
As Erick said, the table's CF ID isn't new, so it's unlikely to be a
schema synchronization issue. Therefore I also suspect the keyspace was
accidentally dropped. Cassandra only logs "Drop Keyspace
'keyspace_name'" on the node that received the "DROP KEYSPACE ..."
query, so you may have to search this in log files from all nodes to
find it.
Assuming the keyspace was dropped but you still have the SSTable files,
you can recover the data by re-creating the keyspace and tables with
identical replication strategy and schema, then copy the SSTable files
to the corresponding new table directories (with different CF ID
suffixes) on the same node, and finally run "nodetool refresh ..." or
restart the node. Since you don't yet have a full backup, I strongly
recommend you to make a backup, and ideally test restoring it to a
different cluster, before attempting to do this.
On 01/03/2021 11:48, Marco Gasparini wrote:
here the previous error:
2021-02-28 05:17:33,262 WARN NodeConnectionsService.java:165
validateAndConnectIfNeeded failed to connect to node
{y.y.y.y}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{y.y.y.y
}{ y.y.y.y :9300}{ALIVE}{rack=r1, dc=DC1} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [ y.y.y.y ][
y.y.y.y :9300] connect_timeout[30s]
at
org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:163)
at
org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:616)
at
org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:513)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:336)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:323)
at
org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:156)
at
org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:185)
at
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672)
at
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Yes this node (y.y.y.y) stopped because it went out of disk space.
I said "deleted" because I'm not a native english speaker :)
I usually "remove" snapshots via 'nodetool clearsnapshot' or
cassandra-reaper user interface.
Il giorno lun 1 mar 2021 alle ore 12:39 Bowen Song
<[email protected]> ha scritto:
What was the warning? Is it related to the disk failure policy?
Could you please share the relevant log? You can edit it and
redact the sensitive information before sharing it.
Also, I can't help to notice that you used the word "delete"
(instead of "clear") to describe the process of removing
snapshots. May I ask how did you delete the snapshots? Was it
"nodetool clearsnapshot ...", "rm -rf ..." or something else?
On 01/03/2021 11:27, Marco Gasparini wrote:
thanks Bowen for answering
Actually, I checked the server log and the only warning was that
a node went offline.
No, I have no backups or snapshots.
In the meantime I found that probably Cassandra moved all files
from a directory to the snapshot directory. I am pretty sure of
that because I have recently deleted all the snapshots I made
because it was going out of disk space and I found this very
directory full of files where the modification timestamp was the
same as the first error I got in the log.
Il giorno lun 1 mar 2021 alle ore 12:13 Bowen Song
<[email protected]> <mailto:[email protected]> ha scritto:
The first thing I'd check is the server log. The log may
contain vital information about the cause of it, and that
there may be different ways to recover from it depending on
the cause.
Also, please allow me to ask a seemingly obvious question, do
you have a backup?
On 01/03/2021 09:34, Marco Gasparini wrote:
hello everybody,
This morning, Monday!!!, I was checking on Cassandra cluster
and I noticed that all data was missing. I noticed the
following error on each node (9 nodes in the cluster):
*2021-03-01 09:05:52,984 WARN
[MessagingService-Incoming-/x.x.x.x]
IncomingTcpConnection.java:103 run
UnknownColumnFamilyException reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException:
Couldn't find table for cfId
cba90a70-5c46-11e9-9e36-f54fe3235e69. If a table was just
created, this is likely due to the schema not being fully
propagated. Please wait for schema agreement on table creation.
at
org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1533)
at
org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:758)
at
org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:697)
at
org.apache.cassandra.io.ForwardingVersionedSerializer.deserialize(ForwardingVersionedSerializer.java:50)
at
org.apache.cassandra.net.MessageIn.read(MessageIn.java:123)
at
org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195)
at
org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183)
at
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94)*
*
*
I tried to query the keyspace and got this:
node1# cqlsh
Connected to Cassandra Cluster at x.x.x.x:9042.
[cqlsh 5.0.1 | Cassandra 3.11.5.1 | CQL spec 3.4.4 | Native
protocol v4]
Use HELP for help.
cqlsh> select * from mykeyspace.mytable where id = 123935;
*InvalidRequest: Error from server: code=2200 [Invalid
query] message="Keyspace * *mykeyspace does not exist"*
*
*
Investigating on each node I found that all the *SStables
exist*, so I think data is still there but the keyspace
vanished, "magically".
Other facts I can tell you are:
* I have been getting Anticompaction errors from 2 nodes
due to the fact the disk was almost full.
* the cluster was online friday
* this morning, Monday, the whole cluster was offline and
I noticed the problem of "missing keyspace"
* During the weekend the cluster has been subject to
inserts and deletes
* I have a 9 node (HDD) Cassandra 3.11 cluster.
I really need help on this, how can I restore the cluster?
Thank you very much
Marco
*
*
*
*
*
*
*
*
*
*