actually I found a lot of .db files in the following directory:

/var/lib/cassandra/data/mykespace/mytable-2795c0204a2d11e9aba361828766468f/snapshots/dropped-1614575293790-
mytable

I also found this:
             2021-03-01 06:08:08,864 INFO  [Native-Transport-Requests-1]
MigrationManager.java:542 announceKeyspaceDrop Drop Keyspace 'mykeyspace'

so I think that you, @erick and @bowen, are right. Something dropped the
keyspace.

I will try to follow your procedure @bowen, thank you very much!

Do you know what could cause this issue?
It seems like a big issue. I found this bug
https://issues.apache.org/jira/browse/CASSANDRA-14957?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel,
maybe they are correlated...

Thank you @Bowen and @Erick





Il giorno lun 1 mar 2021 alle ore 13:39 Bowen Song <bo...@bso.ng.invalid>
ha scritto:

> The warning message indicates the node y.y.y.y went down (or is
> unreachable via network) before 2021-02-28 05:17:33. Is there any chance
> you can find the log file on that node at around or before that time? It
> may show why did that node go down. The reason of that might be irrelevant
> to the missing keyspace, but still worth to have a look in order to prevent
> the same thing from happening again.
>
> As Erick said, the table's CF ID isn't new, so it's unlikely to be a
> schema synchronization issue. Therefore I also suspect the keyspace was
> accidentally dropped. Cassandra only logs "Drop Keyspace 'keyspace_name'"
> on the node that received the "DROP KEYSPACE ..." query, so you may have to
> search this in log files from all nodes to find it.
>
> Assuming the keyspace was dropped but you still have the SSTable files,
> you can recover the data by re-creating the keyspace and tables with
> identical replication strategy and schema, then copy the SSTable files to
> the corresponding new table directories (with different CF ID suffixes) on
> the same node, and finally run "nodetool refresh ..." or restart the node.
> Since you don't yet have a full backup, I strongly recommend you to make a
> backup, and ideally test restoring it to a different cluster, before
> attempting to do this.
>
>
> On 01/03/2021 11:48, Marco Gasparini wrote:
>
> here the previous error:
>
> 2021-02-28 05:17:33,262 WARN NodeConnectionsService.java:165
> validateAndConnectIfNeeded failed to connect to node
> {y.y.y.y}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{y.y.y.y
> }{ y.y.y.y :9300}{ALIVE}{rack=r1, dc=DC1} (tried [1] times)
> org.elasticsearch.transport.ConnectTransportException: [ y.y.y.y ][
> y.y.y.y :9300] connect_timeout[30s]
> at
> org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:163)
> at
> org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:616)
> at
> org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:513)
> at
> org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:336)
> at
> org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:323)
> at
> org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:156)
> at
> org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:185)
> at
> org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672)
> at
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
>
> Yes this node (y.y.y.y) stopped because it went out of disk space.
>
>
> I said "deleted" because I'm not a native english speaker :)
> I usually "remove" snapshots via 'nodetool clearsnapshot' or
> cassandra-reaper user interface.
>
>
>
>
> Il giorno lun 1 mar 2021 alle ore 12:39 Bowen Song <bo...@bso.ng.invalid>
> <bo...@bso.ng.invalid> ha scritto:
>
>> What was the warning? Is it related to the disk failure policy? Could you
>> please share the relevant log? You can edit it and redact the sensitive
>> information before sharing it.
>>
>> Also, I can't help to notice that you used the word "delete" (instead of
>> "clear") to describe the process of removing snapshots. May I ask how did
>> you delete the snapshots? Was it "nodetool clearsnapshot ...", "rm -rf ..."
>> or something else?
>>
>>
>> On 01/03/2021 11:27, Marco Gasparini wrote:
>>
>> thanks Bowen for answering
>>
>> Actually, I checked the server log and the only warning was that a node
>> went offline.
>> No, I have no backups or snapshots.
>>
>> In the meantime I found that probably Cassandra moved all files from a
>> directory to the snapshot directory. I am pretty sure of that because I
>> have recently deleted all the snapshots I made because it was going out of
>> disk space and I found this very directory full of files where the
>> modification timestamp was the same as the first error I got in the log.
>>
>>
>>
>> Il giorno lun 1 mar 2021 alle ore 12:13 Bowen Song <bo...@bso.ng.invalid>
>> <bo...@bso.ng.invalid> ha scritto:
>>
>>> The first thing I'd check is the server log. The log may contain vital
>>> information about the cause of it, and that there may be different ways to
>>> recover from it depending on the cause.
>>>
>>> Also, please allow me to ask a seemingly obvious question, do you have a
>>> backup?
>>>
>>>
>>> On 01/03/2021 09:34, Marco Gasparini wrote:
>>>
>>> hello everybody,
>>>
>>> This morning, Monday!!!, I was checking on Cassandra cluster and I
>>> noticed that all data was missing. I noticed the following error on each
>>> node (9 nodes in the cluster):
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *2021-03-01 09:05:52,984 WARN  [MessagingService-Incoming-/x.x.x.x]
>>> IncomingTcpConnection.java:103 run UnknownColumnFamilyException reading
>>> from socket; closing org.apache.cassandra.db.UnknownColumnFamilyException:
>>> Couldn't find table for cfId cba90a70-5c46-11e9-9e36-f54fe3235e69. If a
>>> table was just created, this is likely due to the schema not being fully
>>> propagated.  Please wait for schema agreement on table creation.         at
>>> org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1533)
>>>         at
>>> org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:758)
>>>         at
>>> org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:697)
>>>         at
>>> org.apache.cassandra.io.ForwardingVersionedSerializer.deserialize(ForwardingVersionedSerializer.java:50)
>>>         at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123)
>>>     at
>>> org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195)
>>>         at
>>> org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183)
>>>         at
>>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94)*
>>>
>>> I tried to query the keyspace and got this:
>>>
>>> node1# cqlsh
>>> Connected to Cassandra Cluster at x.x.x.x:9042.
>>> [cqlsh 5.0.1 | Cassandra 3.11.5.1 | CQL spec 3.4.4 | Native protocol v4]
>>> Use HELP for help.
>>> cqlsh> select * from mykeyspace.mytable  where id = 123935;
>>> *InvalidRequest: Error from server: code=2200 [Invalid query]
>>> message="Keyspace * *mykeyspace  does not exist"*
>>>
>>> Investigating on each node I found that all the *SStables exist*, so I
>>> think data is still there but the keyspace vanished, "magically".
>>>
>>> Other facts I can tell you are:
>>>
>>>    - I have been getting Anticompaction errors from 2 nodes due to the
>>>    fact the disk was almost full.
>>>    - the cluster was online friday
>>>    - this morning, Monday, the whole cluster was offline and I noticed
>>>    the problem of "missing keyspace"
>>>    - During the weekend the cluster has been subject to inserts and
>>>    deletes
>>>    - I have a 9 node (HDD) Cassandra 3.11 cluster.
>>>
>>> I really need help on this, how can I restore the cluster?
>>>
>>> Thank you very much
>>> Marco
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>

Reply via email to