Hello, We've had Cassandra running in a single production data center now for several months and have started detailed plans to add data center fault tolerance.
Our requirements do not appear to be solved out-of-the-box with Cassandra. I'd like to share a solution we're planning and find others considering similar problems. We require the following: 1. Two data centers One is primary, the other hot standby to be used when primary fails. Of course Cassandra has no such bias, but as will be seen below this becomes important when considering app latency. 2. No more than 3 copies of data total We are storing blob-like objects. Cost per unit of usable storage is closely scrutinized vs other solutions. Hence we want to keep replication factor low. Two copies will be held in the primary DC, 1 in the secondary DC - with the corresponding ratio of machines in each DC. 3. Immediate consistency 4. No waiting on remote data center The application front-end runs in the primary data center and expects that operations using a local coordinator node will not suffer a response time determined by the WAN. Hence we cannot require a response from the node in the secondary data center to achieve quorum. 5. Ability to operate with a single working node per key, if necessary We wish to temporarily operate with even a single working node per token in desperate situations involving data center failures or combinations of node and data center failure. Existing Cassandra solutions offer combinations of the above, but it is not at all clear how to achieve all the above without custom work. Normal quorum with N=3 can only work with a single down node regardless of topology. Furthermore if one node in the primary DC fails, quorum requires synchronous operations over the WAN. NetworkTopologyStrategy is nice, but requiring quorum in the primary DC with 2 nodes means no tolerance to a single node failure there. If we're overlooking something I'd love to know. Hence the following proposal for a new replication strategy we're calling SubQuorum. In short SubQuorum allows administratively marking some nodes as being exempt from participating in quorum. As all nodes agree as to exemption status, consistency is still guaranteed as quorum is still achieved amongst the remaining nodes. We gain tremendous flexibility to deal with node and DC failures. Exempt nodes, if up, still receive mutation messages as usual. For example : If a primary DC node fails we can mark its remote counterpart exempt from quorum, hence allowing continued operation without a synchronous call over the WAN. Or another example : If the primary DC fails we mark all primary DC nodes exempt and move the entire application to the secondary DC where it runs as usual but with just the one copy. The implementation is trivial and consists of two pieces: 1. Exempt node management. The list of exempt nodes is broadcast out of band. In our case we're leveraging puppet and a admin server. 2. We've written an implementation of AbstractReplicationStrategy that returns custom QuorumResponseHandler and IWriteResponseHandler. These simply wait for quorum amongst non-exempt nodes. This requires a small change to the AbstractReplicationStrategy interface to pass the endpoints to getQuorumResponseHandler and getWriteResponseHandler, but otherwise changes are contained in the plugin. There is more analysis I can share if anyone is interested. But at this point I'd like to get feedback. Thanks, Wayne Lewis