I haven't made any schema modifications for a year or more.
This problem came up during a "normal day of work" for Cassandra.


Il giorno lun 1 mar 2021 alle ore 16:25 Bowen Song <bo...@bso.ng.invalid>
ha scritto:

> Your missing keyspace problem has nothing to do with that bug.
>
> In that case, the same table was created twice in a very short period of
> time, and I suspect that was done concurrently on two different nodes. The
> evidence lies in the two CF IDs - bd7200a0156711e88974855d74ee356f and
> bd750de0156711e8bdc54f7bcdcb851f, which are created at
> 2018-02-19T11:26:33.898 and 2018-02-19T11:26:33.918 respectively, with a
> merely 20 milliseconds gap between them.
>
> TBH, It doesn't sound like a bug to me. Cassandra is eventually consistent
> by design, and two conflicting schema changes on two different nodes at
> nearly the same time will likely result in schema disagreement and
> Cassandra will eventually reach agreement again, and possibly discarding
> one of the conflicting schema change, together with all data written to the
> discarded table/columns. To make sure this doesn't happen to your data, you
> should avoid doing multiple schema changes to the same keyspace (for
> create/alter/... keyspace) or same table (for create/alter/... table) on
> two or more Cassandra coordinator nodes in a very short period of time.
> Instead, send all your schema change queries to the same coordinator node,
> or if that's not possible, wait for at least 30 seconds between two schema
> changes and make sure you aren't restarting any node at the same time.
>
> On 01/03/2021 14:04, Marco Gasparini wrote:
>
> actually I found a lot of .db files in the following directory:
>
> /var/lib/cassandra/data/mykespace/mytable-2795c0204a2d11e9aba361828766468f/snapshots/dropped-1614575293790-mytable
>
> I also found this:
>              2021-03-01 06:08:08,864 INFO  [Native-Transport-Requests-1]
> MigrationManager.java:542 announceKeyspaceDrop Drop Keyspace 'mykeyspace'
>
> so I think that you, @erick and @bowen, are right. Something dropped the
> keyspace.
>
> I will try to follow your procedure @bowen, thank you very much!
>
> Do you know what could cause this issue?
> It seems like a big issue. I found this bug
> https://issues.apache.org/jira/browse/CASSANDRA-14957?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel,
> maybe they are correlated...
>
> Thank you @Bowen and @Erick
>
>
>
>
>
> Il giorno lun 1 mar 2021 alle ore 13:39 Bowen Song <bo...@bso.ng.invalid>
> <bo...@bso.ng.invalid> ha scritto:
>
>> The warning message indicates the node y.y.y.y went down (or is
>> unreachable via network) before 2021-02-28 05:17:33. Is there any chance
>> you can find the log file on that node at around or before that time? It
>> may show why did that node go down. The reason of that might be irrelevant
>> to the missing keyspace, but still worth to have a look in order to prevent
>> the same thing from happening again.
>>
>> As Erick said, the table's CF ID isn't new, so it's unlikely to be a
>> schema synchronization issue. Therefore I also suspect the keyspace was
>> accidentally dropped. Cassandra only logs "Drop Keyspace 'keyspace_name'"
>> on the node that received the "DROP KEYSPACE ..." query, so you may have to
>> search this in log files from all nodes to find it.
>>
>> Assuming the keyspace was dropped but you still have the SSTable files,
>> you can recover the data by re-creating the keyspace and tables with
>> identical replication strategy and schema, then copy the SSTable files to
>> the corresponding new table directories (with different CF ID suffixes) on
>> the same node, and finally run "nodetool refresh ..." or restart the node.
>> Since you don't yet have a full backup, I strongly recommend you to make a
>> backup, and ideally test restoring it to a different cluster, before
>> attempting to do this.
>>
>>
>> On 01/03/2021 11:48, Marco Gasparini wrote:
>>
>> here the previous error:
>>
>> 2021-02-28 05:17:33,262 WARN NodeConnectionsService.java:165
>> validateAndConnectIfNeeded failed to connect to node
>> {y.y.y.y}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{y.y.y.y
>> }{ y.y.y.y :9300}{ALIVE}{rack=r1, dc=DC1} (tried [1] times)
>> org.elasticsearch.transport.ConnectTransportException: [ y.y.y.y ][
>> y.y.y.y :9300] connect_timeout[30s]
>> at
>> org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:163)
>> at
>> org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:616)
>> at
>> org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:513)
>> at
>> org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:336)
>> at
>> org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:323)
>> at
>> org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:156)
>> at
>> org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:185)
>> at
>> org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672)
>> at
>> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> at java.lang.Thread.run(Thread.java:748)
>>
>> Yes this node (y.y.y.y) stopped because it went out of disk space.
>>
>>
>> I said "deleted" because I'm not a native english speaker :)
>> I usually "remove" snapshots via 'nodetool clearsnapshot' or
>> cassandra-reaper user interface.
>>
>>
>>
>>
>> Il giorno lun 1 mar 2021 alle ore 12:39 Bowen Song <bo...@bso.ng.invalid>
>> <bo...@bso.ng.invalid> ha scritto:
>>
>>> What was the warning? Is it related to the disk failure policy? Could
>>> you please share the relevant log? You can edit it and redact the sensitive
>>> information before sharing it.
>>>
>>> Also, I can't help to notice that you used the word "delete" (instead of
>>> "clear") to describe the process of removing snapshots. May I ask how did
>>> you delete the snapshots? Was it "nodetool clearsnapshot ...", "rm -rf ..."
>>> or something else?
>>>
>>>
>>> On 01/03/2021 11:27, Marco Gasparini wrote:
>>>
>>> thanks Bowen for answering
>>>
>>> Actually, I checked the server log and the only warning was that a node
>>> went offline.
>>> No, I have no backups or snapshots.
>>>
>>> In the meantime I found that probably Cassandra moved all files from a
>>> directory to the snapshot directory. I am pretty sure of that because I
>>> have recently deleted all the snapshots I made because it was going out of
>>> disk space and I found this very directory full of files where the
>>> modification timestamp was the same as the first error I got in the log.
>>>
>>>
>>>
>>> Il giorno lun 1 mar 2021 alle ore 12:13 Bowen Song
>>> <bo...@bso.ng.invalid> <bo...@bso.ng.invalid> ha scritto:
>>>
>>>> The first thing I'd check is the server log. The log may contain vital
>>>> information about the cause of it, and that there may be different ways to
>>>> recover from it depending on the cause.
>>>>
>>>> Also, please allow me to ask a seemingly obvious question, do you have
>>>> a backup?
>>>>
>>>>
>>>> On 01/03/2021 09:34, Marco Gasparini wrote:
>>>>
>>>> hello everybody,
>>>>
>>>> This morning, Monday!!!, I was checking on Cassandra cluster and I
>>>> noticed that all data was missing. I noticed the following error on each
>>>> node (9 nodes in the cluster):
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *2021-03-01 09:05:52,984 WARN  [MessagingService-Incoming-/x.x.x.x]
>>>> IncomingTcpConnection.java:103 run UnknownColumnFamilyException reading
>>>> from socket; closing org.apache.cassandra.db.UnknownColumnFamilyException:
>>>> Couldn't find table for cfId cba90a70-5c46-11e9-9e36-f54fe3235e69. If a
>>>> table was just created, this is likely due to the schema not being fully
>>>> propagated.  Please wait for schema agreement on table creation.         at
>>>> org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1533)
>>>>         at
>>>> org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:758)
>>>>         at
>>>> org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:697)
>>>>         at
>>>> org.apache.cassandra.io.ForwardingVersionedSerializer.deserialize(ForwardingVersionedSerializer.java:50)
>>>>         at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123)
>>>>     at
>>>> org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195)
>>>>         at
>>>> org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183)
>>>>         at
>>>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94)*
>>>>
>>>> I tried to query the keyspace and got this:
>>>>
>>>> node1# cqlsh
>>>> Connected to Cassandra Cluster at x.x.x.x:9042.
>>>> [cqlsh 5.0.1 | Cassandra 3.11.5.1 | CQL spec 3.4.4 | Native protocol v4]
>>>> Use HELP for help.
>>>> cqlsh> select * from mykeyspace.mytable  where id = 123935;
>>>> *InvalidRequest: Error from server: code=2200 [Invalid query]
>>>> message="Keyspace * *mykeyspace  does not exist"*
>>>>
>>>> Investigating on each node I found that all the *SStables exist*, so I
>>>> think data is still there but the keyspace vanished, "magically".
>>>>
>>>> Other facts I can tell you are:
>>>>
>>>>    - I have been getting Anticompaction errors from 2 nodes due to the
>>>>    fact the disk was almost full.
>>>>    - the cluster was online friday
>>>>    - this morning, Monday, the whole cluster was offline and I noticed
>>>>    the problem of "missing keyspace"
>>>>    - During the weekend the cluster has been subject to inserts and
>>>>    deletes
>>>>    - I have a 9 node (HDD) Cassandra 3.11 cluster.
>>>>
>>>> I really need help on this, how can I restore the cluster?
>>>>
>>>> Thank you very much
>>>> Marco
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>

Reply via email to