I noticed this after I upgraded one node in a 0.7 cluster of 5 to the latest stable 0.7 build "2011-02-08_20-41-25" (upgraded  node was jb-cass1 below). This is a long email, you can jump to the end and help me out by checking something on your  07 cluster. 

This is the value from o.a.c.gms.FailureDetector.AllEndpointStates on jb-cass05 9114.67)

/192.168.114.63   X3:2011-02-08_20-41-25   SCHEMA:2f555eb0-3332-11e0-9e8d-c4f8bbf76455   LOAD:2.84182972E8   STATUS:NORMAL,0
/192.168.114.64   SCHEMA:2f555eb0-3332-11e0-9e8d-c4f8bbf76455   LOAD:2.84354156E8   STATUS:NORMAL,34028236692093846346337460743176821145 
/192.168.114.66   SCHEMA:075cbd1f-3316-11e0-9e8d-c4f8bbf76455   LOAD:2.59171601E8   STATUS:NORMAL,102084710076281539039012382229530463435 
/192.168.114.65   SCHEMA:075cbd1f-3316-11e0-9e8d-c4f8bbf76455   LOAD:2.70907168E8   STATUS:NORMAL,68056473384187692692674921486353642290 
jb08.wetafx.co.nz/192.168.114.67   SCHEMA:075cbd1f-3316-11e0-9e8d-c4f8bbf76455   LOAD:1.155260665E9   STATUS:NORMAL,136112946768375385385349842972707284580 

Notice the schema for nodes 63 and 64 starts with 2f55 and for 65, 66 and 67 it starts with 075.

This is the output from pycassa calling describe_versions when connected to both the 63 (jb-cass1) and 67 (jb-cass5) nodes

In [34]: sys.describe_schema_versions()
Out[34]: 
{'2f555eb0-3332-11e0-9e8d-c4f8bbf76455': ['192.168.114.63',
                                          '192.168.114.64',
                                          '192.168.114.65',
                                          '192.168.114.66',
                                          '192.168.114.67']}

It's reporting all nodes on the 2f55 schema. The SchemaCheckVerbHandler is getting the value from DatabaseDescriptor. FailureDetector MBean is getting them from Gossiper.endpointStateMap . Requests are working though, so the CFid's must be matching up. 

Commit https://github.com/apache/cassandra/commit/ecbd71f6b4bb004d26e585ca8a7e642436a5c1a4 added code to the 0.7 branch in the HintedHandOffManager to check the schema versions of nodes it has hints for. This is now failing on the new node as follows...

ERROR [HintedHandoff:1] 2011-02-09 16:11:23,559 AbstractCassandraDaemon.java (line org.apache.cassandra.service.AbstractCassandraDaemon$1.uncaughtException(AbstractCassandraDaemon.java:114)) Fatal exception in thread Thread[HintedHandoff:1,1,main]
java.lang.RuntimeException: java.lang.RuntimeException: Could not reach schema agreement with /192.168.114.64 in 60000ms
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.RuntimeException: Could not reach schema agreement with /192.168114.64 in 60000ms
        at org.apache.cassandra.db.HintedHandOffManager.waitForSchemaAgreement(HintedHandOffManager.java:256)
        at org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:267)
        at org.apache.cassandra.db.HintedHandOffManager.access$100(HintedHandOffManager.java:88)
        at org.apache.cassandra.db.HintedHandOffManager$2.runMayThrow(HintedHandOffManager.java:391)
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
        ... 3 more

(the nodes can all see each other, checked with notetool during the 60 seconds)

If I restart one of the nodes with the 075 schema (without upgrading it) it reads the schema from the system tables and goes back to the 2f55 schema. e.g. the 64 node was also on the 075 schema, I restarted and it moved to the 2f55 and logged appropriately. While writing this email I checked again with the 65 node, and the schema if was reporting to other nodes changed after a restart from 075 to 2f55

INFO [main] 2011-02-09 17:17:11,457 DatabaseDescriptor.java (line org.apache.cassandra.config.DatabaseDescriptor) Loading schema version 2f555eb0-3332-11e0-9e8d-c4f8bbf76455

I've been reading the code for migrations and gossip don't have a theory as to what is going on. 


REQUEST FOR HELP: 

If you have a 0.7 cluster can you please check if this has happened so I can know this is a real problem or just an Aaron problem. You can check by...
- getting the values from the o.a.c.gms.FailureDetector.AllEndPointStates
- running describe_schema_versions via the API, here is how to do it via pycassa http://pycassa.github.com/pycassa/api/pycassa/system_manager.html?highlight=describe_schema_versions
- checking at the schema ids' from the failure detector match the result from describe_schema_versions()
- if they do not match can you also include some info on what sort of schema changes have happened on the box.

Thanks
Aaron

Reply via email to