Tom van der Woerdt created CASSANDRA-12131:
----------------------------------------------

             Summary: system_schema corruption causing nodes to not restart
                 Key: CASSANDRA-12131
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12131
             Project: Cassandra
          Issue Type: Bug
            Reporter: Tom van der Woerdt
            Priority: Critical


Symptoms :
* Existing nodes fail to restart
* system_schema has broken data
* `nodetool describecluster` shows a full disagreement

This happened on two clusters I manage, and so far I have no idea why. I'll 
describe symptoms and info on what I did to (partially) resolve this. Hope the 
actual bug can get fixed.

All clusters run with the binary distribution from cassandra.apache.org. One 
cluster runs on CentOS 6, the other CentOS 7, but both with Java 8u77. The 
issue was seen on version 3.0.4 and during an upgrade from 3.0.6 to 3.0.7.

** Cluster 1 **
Version: 3.0.4
Hardware: 2 datacenters, 3 machines each
Network: 1Gbit, <1ms within the dc, <20ms cross-dc

This happened several months ago. I found out the hard way that every node had 
a different schema_version when I tried to restart a node and it didn't come 
back. Assuming it was just a single unhappy node, I ignored it and restarted a 
second node (in a different datacenter) which also did not come back.

I like my quorums so I didn't restart the other nodes. `nodetool 
describecluster` showed that every node had a different schema version. 
Querying system_schema showed a lot of records with their keys set to 
`\0\0\0\0(...)\0\0`. Cassandra logs indicated corrupted data, which was then 
fixed by running scrub.

Of course that didn't actually fix the data and using CQL I removed most of the 
rows in system_schema that looked wrong. After doing that `nodetool 
describecluster' agreed on a schema version again. I've attached the python 
script I used to remove the records from the 'columns' table (fix.py), similar 
scripts were used for other tables.

That didn't actually remove all the records, some proved impossible to delete :
{code}
# Partial output from the query "select * from system_schema.columns"
|       regular |       -1 |                    text
 system_distributed | 
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
 | 
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
 |             none |        exception_stacktrace 
|       regular |       -1 |                    text
 system_distributed | 
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
 |                                     
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 |             none |               
  finished_at 
|       regular |       -1 |               timestamp
 system_distributed | 
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
 |                             
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 |             none |       
        keyspace_name
{code}

... so I just left those there as it doesn't seem to impact the cluster other 
than spewing this error every minute :
{code}
ERROR [CompactionExecutor:20] 2016-07-04 14:19:59,798 CassandraDaemon.java:201 
- Exception in thread Thread[CompactionExecutor:20,1,main]
java.lang.AssertionError: Invalid clustering for the table: 
org.apache.cassandra.db.Clustering$2@661b79a
        at 
org.apache.cassandra.db.Clustering$Serializer.serialize(Clustering.java:136) 
~[apache-cassandra-3.0.7.jar:3.0.7]
        at 
org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:159)
 ~[apache-cassandra-3.0.7.jar:3.0.7]
        at 
org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:108)
 ~[apache-cassandra-3.0.7.jar:3.0.7]
        at 
org.apache.cassandra.db.ColumnIndex$Builder.add(ColumnIndex.java:144) 
~[apache-cassandra-3.0.7.jar:3.0.7]
        at 
org.apache.cassandra.db.ColumnIndex$Builder.build(ColumnIndex.java:112) 
~[apache-cassandra-3.0.7.jar:3.0.7]
        at 
org.apache.cassandra.db.ColumnIndex.writeAndBuildIndex(ColumnIndex.java:52) 
~[apache-cassandra-3.0.7.jar:3.0.7]
{code}

The cluster works fine now, minus the phantom rows and minutely error on every 
node. As for the two boxes that got killed, they were `removenode`d and added 
back, somewhere in this process.

** Cluster 2 **
Version: 3.0.6
Hardware: 3 datacenters, 13 machines total
Network: 1Gbit, <1ms within the dc, <50ms cross-dc

This is a cluster I use for tests, which involves doing a lot of keyspace 
changes. While doing a 3.0.6->3.0.7 upgrade this morning I noticed that the 
first box I wanted to upgrade immediately didn't come back. Instead of trying 
to fix it (or just rebuilding the cluster) I left it like that and am now 
filing this report.

Startup on this node fails with :
{code}
ERROR [main] 2016-07-04 09:58:44,306 CassandraDaemon.java:698 - Exception 
encountered during startup
java.lang.AssertionError: null
    at 
org.apache.cassandra.config.ColumnDefinition.<init>(ColumnDefinition.java:155) 
~[apache-cassandra-3.0.7.jar:3.0.7]
    at 
org.apache.cassandra.schema.SchemaKeyspace.createColumnFromRow(SchemaKeyspace.java:1015)
 ~[apache-cassandra-3.0.7.jar:3.0.7]
    at 
org.apache.cassandra.schema.SchemaKeyspace.lambda$fetchColumns$12(SchemaKeyspace.java:995)
 ~[apache-cassandra-3.0.7.jar:3.0.7]
    at java.lang.Iterable.forEach(Iterable.java:75) ~[na:1.8.0_77]
    at 
org.apache.cassandra.schema.SchemaKeyspace.fetchColumns(SchemaKeyspace.java:995)
 ~[apache-cassandra-3.0.7.jar:3.0.7]
    at 
org.apache.cassandra.schema.SchemaKeyspace.fetchTable(SchemaKeyspace.java:949) 
~[apache-cassandra-3.0.7.jar:3.0.7]
    at 
org.apache.cassandra.schema.SchemaKeyspace.fetchTables(SchemaKeyspace.java:928) 
~[apache-cassandra-3.0.7.jar:3.0.7]
    at 
org.apache.cassandra.schema.SchemaKeyspace.fetchKeyspace(SchemaKeyspace.java:891)
 ~[apache-cassandra-3.0.7.jar:3.0.7]
    at 
org.apache.cassandra.schema.SchemaKeyspace.fetchKeyspacesWithout(SchemaKeyspace.java:868)
 ~[apache-cassandra-3.0.7.jar:3.0.7]
    at 
org.apache.cassandra.schema.SchemaKeyspace.fetchNonSystemKeyspaces(SchemaKeyspace.java:856)
 ~[apache-cassandra-3.0.7.jar:3.0.7]
    at org.apache.cassandra.config.Schema.loadFromDisk(Schema.java:136) 
~[apache-cassandra-3.0.7.jar:3.0.7]
    at org.apache.cassandra.config.Schema.loadFromDisk(Schema.java:126) 
~[apache-cassandra-3.0.7.jar:3.0.7]
    at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:235) 
[apache-cassandra-3.0.7.jar:3.0.7]
    at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:557) 
[apache-cassandra-3.0.7.jar:3.0.7]
    at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:685) 
[apache-cassandra-3.0.7.jar:3.0.7]
{code}

`nodetool status -r` :
{code}
Datacenter: One
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                      Load       Tokens       Owns (effective)  Host 
ID                               Rack         
DN  cassandra1.dc1.mydomain.com  10.38 MB   256          6.8%              
7470d016-9a45-4e00-819a-77d7e09a14a2  1r1
UN  cassandra2.dc1.mydomain.com  8.64 MB    256          7.3%              
cb93240d-b1c6-47f0-a1bb-59e4ae127a1f  1r2
UN  cassandra3.dc1.mydomain.com  11.32 MB   256          7.6%              
ff6b3342-8142-42ba-8dd0-da00cd4ae95f  1r3
UN  cassandra4.dc1.mydomain.com  12.46 MB   256          7.2%              
91fad227-b394-4e25-be65-0f34a9dbbf9b  1r4
UN  cassandra5.dc1.mydomain.com  12.03 MB   256          8.4%              
74d98f17-df0b-40f2-b23b-7c6e5f49c2d7  1r5
Datacenter: Two
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                      Load       Tokens       Owns (effective)  Host 
ID                               Rack         
UN  cassandra1.dc2.mydomain.com  10.39 MB   256          7.8%              
f49efc68-d530-4074-912a-b008f578c9d0  2r1
UN  cassandra2.dc2.mydomain.com  8.23 MB    256          8.5%              
b339a66e-4ef7-43c2-9507-9ac23dd7ad5c  2r2
UN  cassandra3.dc2.mydomain.com  10.34 MB   256          7.2%              
28d51ab8-5ee2-41a7-9e93-247fdf9f6d85  2r3
Datacenter: Three
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                      Load       Tokens       Owns (effective)  Host 
ID                               Rack         
UN  cassandra1.dc3.mydomain.com  9.47 MB    256          7.2%              
bbd06f32-6d40-49f4-b71c-30227aac20f1  3r1
UN  cassandra2.dc3.mydomain.com  9.88 MB    256          7.7%              
2789cffd-db20-47b9-962e-193326660345  3r2
UN  cassandra3.dc3.mydomain.com  11.36 MB   256          8.5%              
9a11ad49-112b-4b43-b937-f5e12176d725  3r3
UN  cassandra4.dc3.mydomain.com  11.77 MB   256          7.6%              
1009f985-2229-45c6-88c5-64ee508c4c3c  3r4
UN  cassandra5.dc3.mydomain.com  11.11 MB   256          7.9%              
4cbac3e8-c412-4375-ba2b-354a0bd81df8  3r5
{code}

`nodetool describecluster` :
{code}
Cluster Information:
    Name: my_cluster
    Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
    Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
    Schema versions:
        c478ac2f-c773-370c-aeca-d1a7169ad092: [10.xxx.xxx.xxx]
        35f98fc6-3ddc-3358-9c92-d5a251ebc844: [10.xxx.xxx.xxx]
        a1573012-90a1-303f-81af-2ddc387cfc98: [10.xxx.xxx.xxx]
        c4a86820-60ea-371e-a24c-31b2040d18f1: [10.xxx.xxx.xxx]
        1a734c68-c72f-3f0e-ac51-6fadc7854447: [10.xxx.xxx.xxx]
        5042d7d8-c1d2-334c-95ce-443260401940: [10.xxx.xxx.xxx]
        dfc67ce1-5422-30e8-a533-9c2f0c2f7ad9: [10.xxx.xxx.xxx]
        0f32b476-0e6f-3064-8795-5d8adc2b3704: [10.xxx.xxx.xxx]
        31b66ee1-9447-39ff-9953-bad4b01ba87b: [10.xxx.xxx.xxx]
        7bb3cee9-eef5-356a-b435-9500550fda00: [10.xxx.xxx.xxx]
        6adcfe50-2a16-3bc5-93d0-006481c6217e: [10.xxx.xxx.xxx]
        5bb7c619-3e64-3ae0-b50e-8a6b5af78b1a: [10.xxx.xxx.xxx]
        UNREACHABLE: [10.xxx.xxx.xxx]
{code}

Like the other cluster, this cluster has a corrupted system_schema. Partial 
output from "select * from system_schema.keyspaces" :
{code}
 keyspace_name                                | durable_writes | replication
----------------------------------------------+----------------+----------------------------------------------------------------------------------------------------------
                                  system_auth |           True | {'One': '5', 
'Two': '3', 'Three': '5', 'class': 
'org.apache.cassandra.locator.NetworkTopologyStrategy'}
                                system_schema |           True |                
                                  {'class': 
'org.apache.cassandra.locator.LocalStrategy'}
 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 |          False |                
                                                                                
         {}
{code}

The cluster is still up and is able to take reads and writes. In cqlsh's `desc 
keyspaces` I see an additional keyspace that pretends to be an empty string :

{code}
cassandra@cqlsh> desc keyspaces;

system_distributed
system_schema                system               system_traces     
""                system_auth        
{code}

Very curious.

I left the second cluster in this state so that I can answer questions here on 
Jira, if needed.

This issue can potentially destroy a cluster, so I'm marking this as critical. 
The fix for broken nodes seems to be to run my fix.py against every node and 
against every table in system_schema, after running a scrub on those same nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to