Avraham Kalvo created CASSANDRA-14957:
-----------------------------------------

             Summary: Rolling Restart Of Nodes Cause Dataloss Due To Schema 
Collision
                 Key: CASSANDRA-14957
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14957
             Project: Cassandra
          Issue Type: Bug
          Components: Cluster/Schema
            Reporter: Avraham Kalvo


We were issuing a rolling restart on a mission-critical five node C* cluster.
The first node which was restarted got the following messages in its system.log:


```
January 2nd 2019, 12:06:37.310 - INFO 12:06:35 Initializing 
tasks_scheduler_external.tasks
```
```
WARN 12:06:39 UnknownColumnFamilyException reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for 
cfId bd7200a0-1567-11e8-8974-855d74ee356f. If a table was just created, this is 
likely due to the schema not being fully propagated. Please wait for schema 
agreement on table creation.
at 
org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1336)
 ~[apache-cassandra-3.0.10.jar:3.0.10]
at 
org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize30(PartitionUpdate.java:660)
 ~[apache-cassandra-3.0.10.jar:3.0.10]
at 
org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize(PartitionUpdate.java:635)
 ~[apache-cassandra-3.0.10.jar:3.0.10]
at 
org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:330)
 ~[apache-cassandra-3.0.10.jar:3.0.10]
at 
org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:349)
 ~[apache-cassandra-3.0.10.jar:3.0.10]
at 
org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:286)
 ~[apache-cassandra-3.0.10.jar:3.0.10]
at org.apache.cassandra.net.MessageIn.read(MessageIn.java:98) 
~[apache-cassandra-3.0.10.jar:3.0.10]
at 
org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:201)
 ~[apache-cassandra-3.0.10.jar:3.0.10]
at 
org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:178)
 ~[apache-cassandra-3.0.10.jar:3.0.10]
at 
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:92)
 ~[apache-cassandra-3.0.10.jar:3.0.10]

```

The latter was then repeated several times across the cluster.
It was then found out that the table in question 
`tasks_scheduler_external.tasks` was created with a new schema version after 
the entire cluster was restarted consecutively and schema agreement settled, 
which started taking requests leaving the previous version of the schema 
unavailable for any request, thus generating a data loss to our online system.

Data loss was recovered by manually copying SSTables from the previous version 
directory of the schema to the new one followed by `nodetool refresh` to the 
relevant table.

The above has repeated itself for several tables across various keyspaces.

One other thing to mention is that a repair was in place for the first node to 
be restarted, which was obviously stopped as the daemon was shut down, but this 
doesn't seem to do with the above at first glance.

Seems somewhat related to:
https://issues.apache.org/jira/browse/CASSANDRA-13559



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to