The warning message indicates the node y.y.y.y went down (or is unreachable via network) before 2021-02-28 05:17:33. Is there any chance you can find the log file on that node at around or before that time? It may show why did that node go down. The reason of that might be irrelevant to the missing keyspace, but still worth to have a look in order to prevent the same thing from happening again.

As Erick said, the table's CF ID isn't new, so it's unlikely to be a schema synchronization issue. Therefore I also suspect the keyspace was accidentally dropped. Cassandra only logs "Drop Keyspace 'keyspace_name'" on the node that received the "DROP KEYSPACE ..." query, so you may have to search this in log files from all nodes to find it.

Assuming the keyspace was dropped but you still have the SSTable files, you can recover the data by re-creating the keyspace and tables with identical replication strategy and schema, then copy the SSTable files to the corresponding new table directories (with different CF ID suffixes) on the same node, and finally run "nodetool refresh ..." or restart the node. Since you don't yet have a full backup, I strongly recommend you to make a backup, and ideally test restoring it to a different cluster, before attempting to do this.


On 01/03/2021 11:48, Marco Gasparini wrote:
here the previous error:

2021-02-28 05:17:33,262 WARN NodeConnectionsService.java:165 validateAndConnectIfNeeded failed to connect to node {y.y.y.y}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{y.y.y.y }{ y.y.y.y :9300}{ALIVE}{rack=r1, dc=DC1} (tried [1] times) org.elasticsearch.transport.ConnectTransportException: [ y.y.y.y ][ y.y.y.y :9300] connect_timeout[30s] at org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:163) at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:616) at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:513) at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:336) at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:323) at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:156) at org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:185) at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Yes this node (y.y.y.y) stopped because it went out of disk space.


I said "deleted" because I'm not a native english speaker :)
I usually "remove" snapshots via 'nodetool clearsnapshot' or cassandra-reaper user interface.




Il giorno lun 1 mar 2021 alle ore 12:39 Bowen Song <[email protected]> ha scritto:

    What was the warning? Is it related to the disk failure policy?
    Could you please share the relevant log? You can edit it and
    redact the sensitive information before sharing it.

    Also, I can't help to notice that you used the word "delete"
    (instead of "clear") to describe the process of removing
    snapshots. May I ask how did you delete the snapshots? Was it
    "nodetool clearsnapshot ...", "rm -rf ..." or something else?


    On 01/03/2021 11:27, Marco Gasparini wrote:
    thanks Bowen for answering

    Actually, I checked the server log and the only warning was that
    a node went offline.
    No, I have no backups or snapshots.

    In the meantime I found that probably Cassandra moved all files
    from a directory to the snapshot directory. I am pretty sure of
    that because I have recently deleted all the snapshots I made
    because it was going out of disk space and I found this very
    directory full of files where the modification timestamp was the
    same as the first error I got in the log.



    Il giorno lun 1 mar 2021 alle ore 12:13 Bowen Song
    <[email protected]> <mailto:[email protected]> ha scritto:

        The first thing I'd check is the server log. The log may
        contain vital information about the cause of it, and that
        there may be different ways to recover from it depending on
        the cause.

        Also, please allow me to ask a seemingly obvious question, do
        you have a backup?


        On 01/03/2021 09:34, Marco Gasparini wrote:
        hello everybody,

        This morning, Monday!!!, I was checking on Cassandra cluster
        and I noticed that all data was missing. I noticed the
        following error on each node (9 nodes in the cluster):

        *2021-03-01 09:05:52,984 WARN
         [MessagingService-Incoming-/x.x.x.x]
        IncomingTcpConnection.java:103 run
        UnknownColumnFamilyException reading from socket; closing
        org.apache.cassandra.db.UnknownColumnFamilyException:
        Couldn't find table for cfId
        cba90a70-5c46-11e9-9e36-f54fe3235e69. If a table was just
        created, this is likely due to the schema not being fully
        propagated. Please wait for schema agreement on table creation.
                at
        
org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1533)
                at
        
org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:758)
                at
        
org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:697)
                at
        
org.apache.cassandra.io.ForwardingVersionedSerializer.deserialize(ForwardingVersionedSerializer.java:50)
                at
        org.apache.cassandra.net.MessageIn.read(MessageIn.java:123)
                at
        
org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195)
                at
        
org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183)
                at
        
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94)*
        *
        *
        I tried to query the keyspace and got this:

        node1# cqlsh
        Connected to Cassandra Cluster at x.x.x.x:9042.
        [cqlsh 5.0.1 | Cassandra 3.11.5.1 | CQL spec 3.4.4 | Native
        protocol v4]
        Use HELP for help.
        cqlsh> select * from mykeyspace.mytable where id = 123935;
        *InvalidRequest: Error from server: code=2200 [Invalid
        query] message="Keyspace * *mykeyspace  does not exist"*
        *
        *
        Investigating on each node I found that all the *SStables
        exist*, so I think data is still there but the keyspace
        vanished, "magically".

        Other facts I can tell you are:

          * I have been getting Anticompaction errors from 2 nodes
            due to the fact the disk was almost full.
          * the cluster was online friday
          * this morning, Monday, the whole cluster was offline and
            I noticed the problem of "missing keyspace"
          * During the weekend the cluster has been subject to
            inserts and deletes
          * I have a 9 node (HDD) Cassandra 3.11 cluster.

        I really need help on this, how can I restore the cluster?

        Thank you very much
        Marco


        *
        *
        *
        *
        *
        *
        *
        *
        *
        *

Reply via email to