[ https://issues.apache.org/jira/browse/CASSANDRA-12131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tom van der Woerdt updated CASSANDRA-12131: ------------------------------------------- Description: h2. Symptoms * Existing nodes fail to restart * system_schema has broken data * `nodetool describecluster` shows a full disagreement This happened on two clusters I manage, and so far I have no idea why. I'll describe symptoms and info on what I did to (partially) resolve this. Hope the actual bug can get fixed. All clusters run with the binary distribution from cassandra.apache.org. One cluster runs on CentOS 6, the other CentOS 7, but both with Java 8u77. The issue was seen on version 3.0.4 and during an upgrade from 3.0.6 to 3.0.7. h2. Cluster 1 Version: 3.0.4 Hardware: 2 datacenters, 3 machines each Network: 1Gbit, <1ms within the dc, <20ms cross-dc This happened several months ago. I found out the hard way that every node had a different schema_version when I tried to restart a node and it didn't come back. Assuming it was just a single unhappy node, I ignored it and restarted a second node (in a different datacenter) which also did not come back. I like my quorums so I didn't restart the other nodes. `nodetool describecluster` showed that every node had a different schema version. Querying system_schema showed a lot of records with their keys set to `\0\0\0\0(...)\0\0`. Cassandra logs indicated corrupted data, which was then fixed by running scrub. Of course that didn't actually fix the data and using CQL I removed most of the rows in system_schema that looked wrong. After doing that `nodetool describecluster' agreed on a schema version again. I've attached the python script I used to remove the records from the 'columns' table (fix.py), similar scripts were used for other tables. That didn't actually remove all the records, some proved impossible to delete : {code} # Partial output from the query "select * from system_schema.columns" | regular | -1 | text system_distributed | \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 | \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 | none | exception_stacktrace | regular | -1 | text system_distributed | \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 | \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 | none | finished_at | regular | -1 | timestamp system_distributed | \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 | \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 | none | keyspace_name {code} ... so I just left those there as it doesn't seem to impact the cluster other than spewing this error every minute : {code} ERROR [CompactionExecutor:20] 2016-07-04 14:19:59,798 CassandraDaemon.java:201 - Exception in thread Thread[CompactionExecutor:20,1,main] java.lang.AssertionError: Invalid clustering for the table: org.apache.cassandra.db.Clustering$2@661b79a at org.apache.cassandra.db.Clustering$Serializer.serialize(Clustering.java:136) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:159) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:108) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.db.ColumnIndex$Builder.add(ColumnIndex.java:144) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.db.ColumnIndex$Builder.build(ColumnIndex.java:112) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.db.ColumnIndex.writeAndBuildIndex(ColumnIndex.java:52) ~[apache-cassandra-3.0.7.jar:3.0.7] {code} The cluster works fine now, minus the phantom rows and minutely error on every node. As for the two boxes that got killed, they were `removenode`d and added back, somewhere in this process. h2. Cluster 2 Version: 3.0.6 Hardware: 3 datacenters, 13 machines total Network: 1Gbit, <1ms within the dc, <50ms cross-dc This is a cluster I use for tests, which involves doing a lot of keyspace changes. While doing a 3.0.6->3.0.7 upgrade this morning I noticed that the first box I wanted to upgrade immediately didn't come back. Startup on this node fails with : {code} ERROR [main] 2016-07-04 09:58:44,306 CassandraDaemon.java:698 - Exception encountered during startup java.lang.AssertionError: null at org.apache.cassandra.config.ColumnDefinition.<init>(ColumnDefinition.java:155) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.schema.SchemaKeyspace.createColumnFromRow(SchemaKeyspace.java:1015) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.schema.SchemaKeyspace.lambda$fetchColumns$12(SchemaKeyspace.java:995) ~[apache-cassandra-3.0.7.jar:3.0.7] at java.lang.Iterable.forEach(Iterable.java:75) ~[na:1.8.0_77] at org.apache.cassandra.schema.SchemaKeyspace.fetchColumns(SchemaKeyspace.java:995) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.schema.SchemaKeyspace.fetchTable(SchemaKeyspace.java:949) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.schema.SchemaKeyspace.fetchTables(SchemaKeyspace.java:928) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.schema.SchemaKeyspace.fetchKeyspace(SchemaKeyspace.java:891) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.schema.SchemaKeyspace.fetchKeyspacesWithout(SchemaKeyspace.java:868) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.schema.SchemaKeyspace.fetchNonSystemKeyspaces(SchemaKeyspace.java:856) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.config.Schema.loadFromDisk(Schema.java:136) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.config.Schema.loadFromDisk(Schema.java:126) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:235) [apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:557) [apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:685) [apache-cassandra-3.0.7.jar:3.0.7] {code} `nodetool status -r` : {code} Datacenter: One ================ Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack DN cassandra1.dc1.mydomain.com 10.38 MB 256 6.8% 7470d016-9a45-4e00-819a-77d7e09a14a2 1r1 UN cassandra2.dc1.mydomain.com 8.64 MB 256 7.3% cb93240d-b1c6-47f0-a1bb-59e4ae127a1f 1r2 UN cassandra3.dc1.mydomain.com 11.32 MB 256 7.6% ff6b3342-8142-42ba-8dd0-da00cd4ae95f 1r3 UN cassandra4.dc1.mydomain.com 12.46 MB 256 7.2% 91fad227-b394-4e25-be65-0f34a9dbbf9b 1r4 UN cassandra5.dc1.mydomain.com 12.03 MB 256 8.4% 74d98f17-df0b-40f2-b23b-7c6e5f49c2d7 1r5 Datacenter: Two ================ Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN cassandra1.dc2.mydomain.com 10.39 MB 256 7.8% f49efc68-d530-4074-912a-b008f578c9d0 2r1 UN cassandra2.dc2.mydomain.com 8.23 MB 256 8.5% b339a66e-4ef7-43c2-9507-9ac23dd7ad5c 2r2 UN cassandra3.dc2.mydomain.com 10.34 MB 256 7.2% 28d51ab8-5ee2-41a7-9e93-247fdf9f6d85 2r3 Datacenter: Three ================ Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN cassandra1.dc3.mydomain.com 9.47 MB 256 7.2% bbd06f32-6d40-49f4-b71c-30227aac20f1 3r1 UN cassandra2.dc3.mydomain.com 9.88 MB 256 7.7% 2789cffd-db20-47b9-962e-193326660345 3r2 UN cassandra3.dc3.mydomain.com 11.36 MB 256 8.5% 9a11ad49-112b-4b43-b937-f5e12176d725 3r3 UN cassandra4.dc3.mydomain.com 11.77 MB 256 7.6% 1009f985-2229-45c6-88c5-64ee508c4c3c 3r4 UN cassandra5.dc3.mydomain.com 11.11 MB 256 7.9% 4cbac3e8-c412-4375-ba2b-354a0bd81df8 3r5 {code} `nodetool describecluster` : {code} Cluster Information: Name: my_cluster Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch Partitioner: org.apache.cassandra.dht.Murmur3Partitioner Schema versions: c478ac2f-c773-370c-aeca-d1a7169ad092: [10.xxx.xxx.xxx] 35f98fc6-3ddc-3358-9c92-d5a251ebc844: [10.xxx.xxx.xxx] a1573012-90a1-303f-81af-2ddc387cfc98: [10.xxx.xxx.xxx] c4a86820-60ea-371e-a24c-31b2040d18f1: [10.xxx.xxx.xxx] 1a734c68-c72f-3f0e-ac51-6fadc7854447: [10.xxx.xxx.xxx] 5042d7d8-c1d2-334c-95ce-443260401940: [10.xxx.xxx.xxx] dfc67ce1-5422-30e8-a533-9c2f0c2f7ad9: [10.xxx.xxx.xxx] 0f32b476-0e6f-3064-8795-5d8adc2b3704: [10.xxx.xxx.xxx] 31b66ee1-9447-39ff-9953-bad4b01ba87b: [10.xxx.xxx.xxx] 7bb3cee9-eef5-356a-b435-9500550fda00: [10.xxx.xxx.xxx] 6adcfe50-2a16-3bc5-93d0-006481c6217e: [10.xxx.xxx.xxx] 5bb7c619-3e64-3ae0-b50e-8a6b5af78b1a: [10.xxx.xxx.xxx] UNREACHABLE: [10.xxx.xxx.xxx] {code} Like the other cluster, this cluster has a corrupted system_schema. Partial output from "select * from system_schema.keyspaces" : {code} keyspace_name | durable_writes | replication ----------------------------------------------+----------------+---------------------------------------------------------------------------------------------------------- system_auth | True | {'One': '5', 'Two': '3', 'Three': '5', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'} system_schema | True | {'class': 'org.apache.cassandra.locator.LocalStrategy'} \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 | False | {} {code} The cluster is still up and is able to take reads and writes. In cqlsh's `desc keyspaces` I see an additional keyspace that pretends to be an empty string : {code} cassandra@cqlsh> desc keyspaces; system_distributed system_schema system system_traces "" system_auth {code} Very curious. This issue can potentially destroy a cluster, so I'm marking this as critical. The fix for broken nodes seems to be to run my fix.py against every node and against every table in system_schema, after running a scrub on those same nodes. was: Symptoms : * Existing nodes fail to restart * system_schema has broken data * `nodetool describecluster` shows a full disagreement This happened on two clusters I manage, and so far I have no idea why. I'll describe symptoms and info on what I did to (partially) resolve this. Hope the actual bug can get fixed. All clusters run with the binary distribution from cassandra.apache.org. One cluster runs on CentOS 6, the other CentOS 7, but both with Java 8u77. The issue was seen on version 3.0.4 and during an upgrade from 3.0.6 to 3.0.7. ** Cluster 1 ** Version: 3.0.4 Hardware: 2 datacenters, 3 machines each Network: 1Gbit, <1ms within the dc, <20ms cross-dc This happened several months ago. I found out the hard way that every node had a different schema_version when I tried to restart a node and it didn't come back. Assuming it was just a single unhappy node, I ignored it and restarted a second node (in a different datacenter) which also did not come back. I like my quorums so I didn't restart the other nodes. `nodetool describecluster` showed that every node had a different schema version. Querying system_schema showed a lot of records with their keys set to `\0\0\0\0(...)\0\0`. Cassandra logs indicated corrupted data, which was then fixed by running scrub. Of course that didn't actually fix the data and using CQL I removed most of the rows in system_schema that looked wrong. After doing that `nodetool describecluster' agreed on a schema version again. I've attached the python script I used to remove the records from the 'columns' table (fix.py), similar scripts were used for other tables. That didn't actually remove all the records, some proved impossible to delete : {code} # Partial output from the query "select * from system_schema.columns" | regular | -1 | text system_distributed | \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 | \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 | none | exception_stacktrace | regular | -1 | text system_distributed | \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 | \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 | none | finished_at | regular | -1 | timestamp system_distributed | \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 | \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 | none | keyspace_name {code} ... so I just left those there as it doesn't seem to impact the cluster other than spewing this error every minute : {code} ERROR [CompactionExecutor:20] 2016-07-04 14:19:59,798 CassandraDaemon.java:201 - Exception in thread Thread[CompactionExecutor:20,1,main] java.lang.AssertionError: Invalid clustering for the table: org.apache.cassandra.db.Clustering$2@661b79a at org.apache.cassandra.db.Clustering$Serializer.serialize(Clustering.java:136) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:159) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:108) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.db.ColumnIndex$Builder.add(ColumnIndex.java:144) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.db.ColumnIndex$Builder.build(ColumnIndex.java:112) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.db.ColumnIndex.writeAndBuildIndex(ColumnIndex.java:52) ~[apache-cassandra-3.0.7.jar:3.0.7] {code} The cluster works fine now, minus the phantom rows and minutely error on every node. As for the two boxes that got killed, they were `removenode`d and added back, somewhere in this process. ** Cluster 2 ** Version: 3.0.6 Hardware: 3 datacenters, 13 machines total Network: 1Gbit, <1ms within the dc, <50ms cross-dc This is a cluster I use for tests, which involves doing a lot of keyspace changes. While doing a 3.0.6->3.0.7 upgrade this morning I noticed that the first box I wanted to upgrade immediately didn't come back. Instead of trying to fix it (or just rebuilding the cluster) I left it like that and am now filing this report. Startup on this node fails with : {code} ERROR [main] 2016-07-04 09:58:44,306 CassandraDaemon.java:698 - Exception encountered during startup java.lang.AssertionError: null at org.apache.cassandra.config.ColumnDefinition.<init>(ColumnDefinition.java:155) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.schema.SchemaKeyspace.createColumnFromRow(SchemaKeyspace.java:1015) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.schema.SchemaKeyspace.lambda$fetchColumns$12(SchemaKeyspace.java:995) ~[apache-cassandra-3.0.7.jar:3.0.7] at java.lang.Iterable.forEach(Iterable.java:75) ~[na:1.8.0_77] at org.apache.cassandra.schema.SchemaKeyspace.fetchColumns(SchemaKeyspace.java:995) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.schema.SchemaKeyspace.fetchTable(SchemaKeyspace.java:949) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.schema.SchemaKeyspace.fetchTables(SchemaKeyspace.java:928) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.schema.SchemaKeyspace.fetchKeyspace(SchemaKeyspace.java:891) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.schema.SchemaKeyspace.fetchKeyspacesWithout(SchemaKeyspace.java:868) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.schema.SchemaKeyspace.fetchNonSystemKeyspaces(SchemaKeyspace.java:856) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.config.Schema.loadFromDisk(Schema.java:136) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.config.Schema.loadFromDisk(Schema.java:126) ~[apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:235) [apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:557) [apache-cassandra-3.0.7.jar:3.0.7] at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:685) [apache-cassandra-3.0.7.jar:3.0.7] {code} `nodetool status -r` : {code} Datacenter: One ================ Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack DN cassandra1.dc1.mydomain.com 10.38 MB 256 6.8% 7470d016-9a45-4e00-819a-77d7e09a14a2 1r1 UN cassandra2.dc1.mydomain.com 8.64 MB 256 7.3% cb93240d-b1c6-47f0-a1bb-59e4ae127a1f 1r2 UN cassandra3.dc1.mydomain.com 11.32 MB 256 7.6% ff6b3342-8142-42ba-8dd0-da00cd4ae95f 1r3 UN cassandra4.dc1.mydomain.com 12.46 MB 256 7.2% 91fad227-b394-4e25-be65-0f34a9dbbf9b 1r4 UN cassandra5.dc1.mydomain.com 12.03 MB 256 8.4% 74d98f17-df0b-40f2-b23b-7c6e5f49c2d7 1r5 Datacenter: Two ================ Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN cassandra1.dc2.mydomain.com 10.39 MB 256 7.8% f49efc68-d530-4074-912a-b008f578c9d0 2r1 UN cassandra2.dc2.mydomain.com 8.23 MB 256 8.5% b339a66e-4ef7-43c2-9507-9ac23dd7ad5c 2r2 UN cassandra3.dc2.mydomain.com 10.34 MB 256 7.2% 28d51ab8-5ee2-41a7-9e93-247fdf9f6d85 2r3 Datacenter: Three ================ Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN cassandra1.dc3.mydomain.com 9.47 MB 256 7.2% bbd06f32-6d40-49f4-b71c-30227aac20f1 3r1 UN cassandra2.dc3.mydomain.com 9.88 MB 256 7.7% 2789cffd-db20-47b9-962e-193326660345 3r2 UN cassandra3.dc3.mydomain.com 11.36 MB 256 8.5% 9a11ad49-112b-4b43-b937-f5e12176d725 3r3 UN cassandra4.dc3.mydomain.com 11.77 MB 256 7.6% 1009f985-2229-45c6-88c5-64ee508c4c3c 3r4 UN cassandra5.dc3.mydomain.com 11.11 MB 256 7.9% 4cbac3e8-c412-4375-ba2b-354a0bd81df8 3r5 {code} `nodetool describecluster` : {code} Cluster Information: Name: my_cluster Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch Partitioner: org.apache.cassandra.dht.Murmur3Partitioner Schema versions: c478ac2f-c773-370c-aeca-d1a7169ad092: [10.xxx.xxx.xxx] 35f98fc6-3ddc-3358-9c92-d5a251ebc844: [10.xxx.xxx.xxx] a1573012-90a1-303f-81af-2ddc387cfc98: [10.xxx.xxx.xxx] c4a86820-60ea-371e-a24c-31b2040d18f1: [10.xxx.xxx.xxx] 1a734c68-c72f-3f0e-ac51-6fadc7854447: [10.xxx.xxx.xxx] 5042d7d8-c1d2-334c-95ce-443260401940: [10.xxx.xxx.xxx] dfc67ce1-5422-30e8-a533-9c2f0c2f7ad9: [10.xxx.xxx.xxx] 0f32b476-0e6f-3064-8795-5d8adc2b3704: [10.xxx.xxx.xxx] 31b66ee1-9447-39ff-9953-bad4b01ba87b: [10.xxx.xxx.xxx] 7bb3cee9-eef5-356a-b435-9500550fda00: [10.xxx.xxx.xxx] 6adcfe50-2a16-3bc5-93d0-006481c6217e: [10.xxx.xxx.xxx] 5bb7c619-3e64-3ae0-b50e-8a6b5af78b1a: [10.xxx.xxx.xxx] UNREACHABLE: [10.xxx.xxx.xxx] {code} Like the other cluster, this cluster has a corrupted system_schema. Partial output from "select * from system_schema.keyspaces" : {code} keyspace_name | durable_writes | replication ----------------------------------------------+----------------+---------------------------------------------------------------------------------------------------------- system_auth | True | {'One': '5', 'Two': '3', 'Three': '5', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'} system_schema | True | {'class': 'org.apache.cassandra.locator.LocalStrategy'} \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 | False | {} {code} The cluster is still up and is able to take reads and writes. In cqlsh's `desc keyspaces` I see an additional keyspace that pretends to be an empty string : {code} cassandra@cqlsh> desc keyspaces; system_distributed system_schema system system_traces "" system_auth {code} Very curious. I left the second cluster in this state so that I can answer questions here on Jira, if needed. This issue can potentially destroy a cluster, so I'm marking this as critical. The fix for broken nodes seems to be to run my fix.py against every node and against every table in system_schema, after running a scrub on those same nodes. > system_schema corruption causing nodes to not restart > ----------------------------------------------------- > > Key: CASSANDRA-12131 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12131 > Project: Cassandra > Issue Type: Bug > Reporter: Tom van der Woerdt > Priority: Critical > Fix For: 3.0.x > > > h2. Symptoms > * Existing nodes fail to restart > * system_schema has broken data > * `nodetool describecluster` shows a full disagreement > This happened on two clusters I manage, and so far I have no idea why. I'll > describe symptoms and info on what I did to (partially) resolve this. Hope > the actual bug can get fixed. > All clusters run with the binary distribution from cassandra.apache.org. One > cluster runs on CentOS 6, the other CentOS 7, but both with Java 8u77. The > issue was seen on version 3.0.4 and during an upgrade from 3.0.6 to 3.0.7. > h2. Cluster 1 > Version: 3.0.4 > Hardware: 2 datacenters, 3 machines each > Network: 1Gbit, <1ms within the dc, <20ms cross-dc > This happened several months ago. I found out the hard way that every node > had a different schema_version when I tried to restart a node and it didn't > come back. Assuming it was just a single unhappy node, I ignored it and > restarted a second node (in a different datacenter) which also did not come > back. > I like my quorums so I didn't restart the other nodes. `nodetool > describecluster` showed that every node had a different schema version. > Querying system_schema showed a lot of records with their keys set to > `\0\0\0\0(...)\0\0`. Cassandra logs indicated corrupted data, which was then > fixed by running scrub. > Of course that didn't actually fix the data and using CQL I removed most of > the rows in system_schema that looked wrong. After doing that `nodetool > describecluster' agreed on a schema version again. I've attached the python > script I used to remove the records from the 'columns' table (fix.py), > similar scripts were used for other tables. > That didn't actually remove all the records, some proved impossible to delete > : > {code} > # Partial output from the query "select * from system_schema.columns" > | regular | -1 | text > system_distributed | > \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 > | > \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 > | none | exception_stacktrace > | regular | -1 | text > system_distributed | > \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 > | > \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 | none | > finished_at > | regular | -1 | timestamp > system_distributed | > \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 > | > \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 | none | > keyspace_name > {code} > ... so I just left those there as it doesn't seem to impact the cluster other > than spewing this error every minute : > {code} > ERROR [CompactionExecutor:20] 2016-07-04 14:19:59,798 > CassandraDaemon.java:201 - Exception in thread > Thread[CompactionExecutor:20,1,main] > java.lang.AssertionError: Invalid clustering for the table: > org.apache.cassandra.db.Clustering$2@661b79a > at > org.apache.cassandra.db.Clustering$Serializer.serialize(Clustering.java:136) > ~[apache-cassandra-3.0.7.jar:3.0.7] > at > org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:159) > ~[apache-cassandra-3.0.7.jar:3.0.7] > at > org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:108) > ~[apache-cassandra-3.0.7.jar:3.0.7] > at > org.apache.cassandra.db.ColumnIndex$Builder.add(ColumnIndex.java:144) > ~[apache-cassandra-3.0.7.jar:3.0.7] > at > org.apache.cassandra.db.ColumnIndex$Builder.build(ColumnIndex.java:112) > ~[apache-cassandra-3.0.7.jar:3.0.7] > at > org.apache.cassandra.db.ColumnIndex.writeAndBuildIndex(ColumnIndex.java:52) > ~[apache-cassandra-3.0.7.jar:3.0.7] > {code} > The cluster works fine now, minus the phantom rows and minutely error on > every node. As for the two boxes that got killed, they were `removenode`d and > added back, somewhere in this process. > h2. Cluster 2 > Version: 3.0.6 > Hardware: 3 datacenters, 13 machines total > Network: 1Gbit, <1ms within the dc, <50ms cross-dc > This is a cluster I use for tests, which involves doing a lot of keyspace > changes. While doing a 3.0.6->3.0.7 upgrade this morning I noticed that the > first box I wanted to upgrade immediately didn't come back. > Startup on this node fails with : > {code} > ERROR [main] 2016-07-04 09:58:44,306 CassandraDaemon.java:698 - Exception > encountered during startup > java.lang.AssertionError: null > at > org.apache.cassandra.config.ColumnDefinition.<init>(ColumnDefinition.java:155) > ~[apache-cassandra-3.0.7.jar:3.0.7] > at > org.apache.cassandra.schema.SchemaKeyspace.createColumnFromRow(SchemaKeyspace.java:1015) > ~[apache-cassandra-3.0.7.jar:3.0.7] > at > org.apache.cassandra.schema.SchemaKeyspace.lambda$fetchColumns$12(SchemaKeyspace.java:995) > ~[apache-cassandra-3.0.7.jar:3.0.7] > at java.lang.Iterable.forEach(Iterable.java:75) ~[na:1.8.0_77] > at > org.apache.cassandra.schema.SchemaKeyspace.fetchColumns(SchemaKeyspace.java:995) > ~[apache-cassandra-3.0.7.jar:3.0.7] > at > org.apache.cassandra.schema.SchemaKeyspace.fetchTable(SchemaKeyspace.java:949) > ~[apache-cassandra-3.0.7.jar:3.0.7] > at > org.apache.cassandra.schema.SchemaKeyspace.fetchTables(SchemaKeyspace.java:928) > ~[apache-cassandra-3.0.7.jar:3.0.7] > at > org.apache.cassandra.schema.SchemaKeyspace.fetchKeyspace(SchemaKeyspace.java:891) > ~[apache-cassandra-3.0.7.jar:3.0.7] > at > org.apache.cassandra.schema.SchemaKeyspace.fetchKeyspacesWithout(SchemaKeyspace.java:868) > ~[apache-cassandra-3.0.7.jar:3.0.7] > at > org.apache.cassandra.schema.SchemaKeyspace.fetchNonSystemKeyspaces(SchemaKeyspace.java:856) > ~[apache-cassandra-3.0.7.jar:3.0.7] > at org.apache.cassandra.config.Schema.loadFromDisk(Schema.java:136) > ~[apache-cassandra-3.0.7.jar:3.0.7] > at org.apache.cassandra.config.Schema.loadFromDisk(Schema.java:126) > ~[apache-cassandra-3.0.7.jar:3.0.7] > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:235) > [apache-cassandra-3.0.7.jar:3.0.7] > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:557) > [apache-cassandra-3.0.7.jar:3.0.7] > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:685) > [apache-cassandra-3.0.7.jar:3.0.7] > {code} > `nodetool status -r` : > {code} > Datacenter: One > ================ > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) > Host ID Rack > DN cassandra1.dc1.mydomain.com 10.38 MB 256 6.8% > 7470d016-9a45-4e00-819a-77d7e09a14a2 1r1 > UN cassandra2.dc1.mydomain.com 8.64 MB 256 7.3% > cb93240d-b1c6-47f0-a1bb-59e4ae127a1f 1r2 > UN cassandra3.dc1.mydomain.com 11.32 MB 256 7.6% > ff6b3342-8142-42ba-8dd0-da00cd4ae95f 1r3 > UN cassandra4.dc1.mydomain.com 12.46 MB 256 7.2% > 91fad227-b394-4e25-be65-0f34a9dbbf9b 1r4 > UN cassandra5.dc1.mydomain.com 12.03 MB 256 8.4% > 74d98f17-df0b-40f2-b23b-7c6e5f49c2d7 1r5 > Datacenter: Two > ================ > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) > Host ID Rack > UN cassandra1.dc2.mydomain.com 10.39 MB 256 7.8% > f49efc68-d530-4074-912a-b008f578c9d0 2r1 > UN cassandra2.dc2.mydomain.com 8.23 MB 256 8.5% > b339a66e-4ef7-43c2-9507-9ac23dd7ad5c 2r2 > UN cassandra3.dc2.mydomain.com 10.34 MB 256 7.2% > 28d51ab8-5ee2-41a7-9e93-247fdf9f6d85 2r3 > Datacenter: Three > ================ > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns (effective) > Host ID Rack > UN cassandra1.dc3.mydomain.com 9.47 MB 256 7.2% > bbd06f32-6d40-49f4-b71c-30227aac20f1 3r1 > UN cassandra2.dc3.mydomain.com 9.88 MB 256 7.7% > 2789cffd-db20-47b9-962e-193326660345 3r2 > UN cassandra3.dc3.mydomain.com 11.36 MB 256 8.5% > 9a11ad49-112b-4b43-b937-f5e12176d725 3r3 > UN cassandra4.dc3.mydomain.com 11.77 MB 256 7.6% > 1009f985-2229-45c6-88c5-64ee508c4c3c 3r4 > UN cassandra5.dc3.mydomain.com 11.11 MB 256 7.9% > 4cbac3e8-c412-4375-ba2b-354a0bd81df8 3r5 > {code} > `nodetool describecluster` : > {code} > Cluster Information: > Name: my_cluster > Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch > Partitioner: org.apache.cassandra.dht.Murmur3Partitioner > Schema versions: > c478ac2f-c773-370c-aeca-d1a7169ad092: [10.xxx.xxx.xxx] > 35f98fc6-3ddc-3358-9c92-d5a251ebc844: [10.xxx.xxx.xxx] > a1573012-90a1-303f-81af-2ddc387cfc98: [10.xxx.xxx.xxx] > c4a86820-60ea-371e-a24c-31b2040d18f1: [10.xxx.xxx.xxx] > 1a734c68-c72f-3f0e-ac51-6fadc7854447: [10.xxx.xxx.xxx] > 5042d7d8-c1d2-334c-95ce-443260401940: [10.xxx.xxx.xxx] > dfc67ce1-5422-30e8-a533-9c2f0c2f7ad9: [10.xxx.xxx.xxx] > 0f32b476-0e6f-3064-8795-5d8adc2b3704: [10.xxx.xxx.xxx] > 31b66ee1-9447-39ff-9953-bad4b01ba87b: [10.xxx.xxx.xxx] > 7bb3cee9-eef5-356a-b435-9500550fda00: [10.xxx.xxx.xxx] > 6adcfe50-2a16-3bc5-93d0-006481c6217e: [10.xxx.xxx.xxx] > 5bb7c619-3e64-3ae0-b50e-8a6b5af78b1a: [10.xxx.xxx.xxx] > UNREACHABLE: [10.xxx.xxx.xxx] > {code} > Like the other cluster, this cluster has a corrupted system_schema. Partial > output from "select * from system_schema.keyspaces" : > {code} > keyspace_name | durable_writes | replication > ----------------------------------------------+----------------+---------------------------------------------------------------------------------------------------------- > system_auth | True | {'One': '5', > 'Two': '3', 'Three': '5', 'class': > 'org.apache.cassandra.locator.NetworkTopologyStrategy'} > system_schema | True | > {'class': > 'org.apache.cassandra.locator.LocalStrategy'} > \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 | False | > > {} > {code} > The cluster is still up and is able to take reads and writes. In cqlsh's > `desc keyspaces` I see an additional keyspace that pretends to be an empty > string : > {code} > cassandra@cqlsh> desc keyspaces; > system_distributed > system_schema system system_traces > "" system_auth > {code} > Very curious. > This issue can potentially destroy a cluster, so I'm marking this as > critical. The fix for broken nodes seems to be to run my fix.py against every > node and against every table in system_schema, after running a scrub on those > same nodes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)