[ https://issues.apache.org/jira/browse/CASSANDRA-14957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Avraham Kalvo updated CASSANDRA-14957: -------------------------------------- Summary: Rolling Restart Of Nodes Causes Dataloss Due To Schema Collision (was: Rolling Restart Of Nodes Cause Dataloss Due To Schema Collision) > Rolling Restart Of Nodes Causes Dataloss Due To Schema Collision > ---------------------------------------------------------------- > > Key: CASSANDRA-14957 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14957 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Schema > Reporter: Avraham Kalvo > Priority: Major > > We were issuing a rolling restart on a mission-critical five node C* cluster. > The first node which was restarted got the following messages in its > system.log: > ``` > January 2nd 2019, 12:06:37.310 - INFO 12:06:35 Initializing > tasks_scheduler_external.tasks > ``` > ``` > WARN 12:06:39 UnknownColumnFamilyException reading from socket; closing > org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for > cfId bd7200a0-1567-11e8-8974-855d74ee356f. If a table was just created, this > is likely due to the schema not being fully propagated. Please wait for > schema agreement on table creation. > at > org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1336) > ~[apache-cassandra-3.0.10.jar:3.0.10] > at > org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize30(PartitionUpdate.java:660) > ~[apache-cassandra-3.0.10.jar:3.0.10] > at > org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize(PartitionUpdate.java:635) > ~[apache-cassandra-3.0.10.jar:3.0.10] > at > org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:330) > ~[apache-cassandra-3.0.10.jar:3.0.10] > at > org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:349) > ~[apache-cassandra-3.0.10.jar:3.0.10] > at > org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:286) > ~[apache-cassandra-3.0.10.jar:3.0.10] > at org.apache.cassandra.net.MessageIn.read(MessageIn.java:98) > ~[apache-cassandra-3.0.10.jar:3.0.10] > at > org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:201) > ~[apache-cassandra-3.0.10.jar:3.0.10] > at > org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:178) > ~[apache-cassandra-3.0.10.jar:3.0.10] > at > org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:92) > ~[apache-cassandra-3.0.10.jar:3.0.10] > ``` > The latter was then repeated several times across the cluster. > It was then found out that the table in question > `tasks_scheduler_external.tasks` was created with a new schema version > sometime along the entire cluster consecutive restart and became available > once the schema agreement settled, which started taking requests leaving the > previous version of the schema unavailable for any request, thus generating a > data loss to our online system. > Data loss was recovered by manually copying SSTables from the previous > version directory of the schema to the new one followed by `nodetool refresh` > to the relevant table. > The above has repeated itself for several tables across various keyspaces. > One other thing to mention is that a repair was in place for the first node > to be restarted, which was obviously stopped as the daemon was shut down, but > this doesn't seem to do with the above at first glance. > Seems somewhat related to: > https://issues.apache.org/jira/browse/CASSANDRA-13559 -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org