[ 
https://issues.apache.org/jira/browse/CASSANDRA-14957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16741791#comment-16741791
 ] 

Vincent White commented on CASSANDRA-14957:
-------------------------------------------

I believe this can happen due a race condition when issuing create statements. 
If a CREATE TABLE statement for the same name is issued on a two different 
nodes at the same time you then series of events will look like the following:

Node1 will create and propagate the table as normal, with column family ID 
cf_id1.

If the Node2 gets past the below check before receiving the schema change from 
the Node1, then Node2 will continue executing its CREATE TABLE statement as 
normal except with its own column family ID cf_id2.
{code:java|title=org/apache/cassandra/service/MigrationManager.java:378}
       // If we have a table or a view which has the same name, we can't add a 
new one
        else if (throwOnDuplicate && ksm.getTableOrViewNullable(cfm.cfName) != 
null)
            throw new AlreadyExistsException(cfm.ksName, cfm.cfName);

        logger.info("Create new table: {}", cfm);
        announce(SchemaKeyspace.makeCreateTableMutation(ksm, cfm, timestamp), 
announceLocally);
{code}
Node2 will send out its own set of schema mutations as normal via announce(). 
On all nodes that receive this change, and locally on Node2, they will write 
the schema changes to disk 
(*org/apache/cassandra/schema/SchemaKeyspace.java:1390)* before attempting to 
merge them with their live schema. When attempting to merge the changes with 
their live schema 
*org.apache.cassandra.config.CFMetaData#validateCompatibility* will throw a 
configuration exception and stop the new change being merged. The changes 
written to disk are not rolled back.

All nodes will continue to use the table definition in their live schema 
(cf_id1) and everything will continue to work as expected as if the second 
CREATE TABLE statement was ignored. The issue is that all nodes now have the 
wrong column family ID recorded in their *system_schema.tables* system tables. 
When the nodes restart they we read their schema from disk and start using the 
wrong column family ID, at which point they will make a new empty folder on 
disk for it and you will start seeing the types of errors you've mentioned.

This of course isn't just limited to corrupting the column family ID but I 
believe this can apply to any part of the column family definition.

I believe this is solved in trunk with the changes introduced as part of 
CASSANDRA-10699

> Rolling Restart Of Nodes Causes Dataloss Due To Schema Collision
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-14957
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14957
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Schema
>            Reporter: Avraham Kalvo
>            Priority: Major
>
> We were issuing a rolling restart on a mission-critical five node C* cluster.
> The first node which was restarted got the following messages in its 
> system.log:
> ```
> January 2nd 2019, 12:06:37.310 - INFO 12:06:35 Initializing 
> tasks_scheduler_external.tasks
> ```
> ```
> WARN 12:06:39 UnknownColumnFamilyException reading from socket; closing
> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for 
> cfId bd7200a0-1567-11e8-8974-855d74ee356f. If a table was just created, this 
> is likely due to the schema not being fully propagated. Please wait for 
> schema agreement on table creation.
> at 
> org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1336)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize30(PartitionUpdate.java:660)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize(PartitionUpdate.java:635)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:330)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:349)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:286)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at org.apache.cassandra.net.MessageIn.read(MessageIn.java:98) 
> ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:201)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:178)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:92)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> ```
> The latter was then repeated several times across the cluster.
> It was then found out that the table in question 
> `tasks_scheduler_external.tasks` was created with a new schema version 
> sometime along the entire cluster consecutive restart and became available 
> once the schema agreement settled, which started taking requests leaving the 
> previous version of the schema unavailable for any request, thus generating a 
> data loss to our online system.
> Data loss was recovered by manually copying SSTables from the previous 
> version directory of the schema to the new one followed by `nodetool refresh` 
> to the relevant table.
> The above has repeated itself for several tables across various keyspaces.
> One other thing to mention is that a repair was in place for the first node 
> to be restarted, which was obviously stopped as the daemon was shut down, but 
> this doesn't seem to do with the above at first glance.
> Seems somewhat related to:
> https://issues.apache.org/jira/browse/CASSANDRA-13559



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to