[ https://issues.apache.org/jira/browse/CASSANDRA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932917#action_12932917 ]
Sylvain Lebresne commented on CASSANDRA-1072: --------------------------------------------- Thanks for answer. I haven't had the time to have a look at the new patch, sorry, but I will as soon as time permits. A few answers to your answers tho. {quote} 1. Is this the kind of IP address situation you are referring to? A cluster of nodes: A (127.0.0.1), B (127.0.0.2), and C (127.0.0.3) have been running and are not fully consistent. They're brought back up w/ shuffled ips, like so: A (127.0.0.2), B (127.0.0.3), and C (127.0.0.1). A has the most up-to-date view of writes to 127.0.0.1, however, C is now in-charge of writes to 127.0.0.1. i.e. any writes to A that C had not seen, previously, have now been lost. {quote} That's one scenario but I think you'll actually be very lucky if in such scenario, you only "loose a few non replicated updates". There is much (much) worst. Suppose you have your 3 nodes cluster (and say RF=2 or 3). Node A accepts one or more counter update and its part of the counter is say 10. This value 10 is replicated to B (as part of "repair on write" or read repair). On B, the memtable is flushed, this value 10 is in one of B sstable. Now A accepts more update(s), yielding the value to say 15. Again, this value 15 is replicated. At this point, the cluster is coherent and the value for the counter is 15. But say somehow the cluster is shutdown, there is some IP mixup and B is restarted with the ip that A had before. Now, any read (on B) will reconcile the two values 10 and 15, merge them (because it now believe that these are updates it has accepted and as such are deltas, while they are not) and yield 25. Very quickly, replication will pollute every other node in the cluster with this bogus value and compaction will make it permanent. Potentially, any change of a node IP that uses an IP that has been used for another node at some point (even a decommissioned one) can be harmful (and dramatically so), unless you know that everything has been compacted nice and clean. So while I agree that such change of IPs are not supposed to be the norm, it can, and so it will, happen (even in test environment, where one could be less prudent and thus such scenario are even more likely to happen, it will pissed off people real bad). I'm strongly opposed (and will always be) to any change to Cassandra that will destroy data because someone in the op team has messed up and hit the enter key a bit too quickly. But that's just my humble opinion and its open source, so anybody else, please chime in and give yours. {quote} A fix with UUIDs is possible but it's beyond the scope of this jira. {quote} Because of what's above, I disagree with this. Even more so because I'm not at all convinced that this could be easily fixed afterwards. {quote} 2. Valid issue, but it does sound like something of an edge case. For a first version of 1072 it seems reasonable that instructions for ops would be sufficient for this problem. If the community then still feels it's a problem we can look at how to improve the code. {quote} Not sure that's an edge case. Right now, when a node is boostrapped, repair is not run automatically at the end of the boostrap (in parts) because more failures happen quickly. Thus a good advice is to wait a bit to make sure the new node behave alright before running repair on the other nodes, to have a quick roll back if the new node doesn't behave correctly. Boostrap followed by decommission seems to me bound to happen from time to time (if someone feels like confirming/denying ?). That repair have not been run when this happens doesn't seems a crazy scenario at all either. And anyway, as for 1, the risk is to corrupt data (for the same reason, because a node will merge values that are not deltas). I don't consider that "telling people to be careful" is a fix. And because I don't think fixing that will be easy, I'm not comfortable with seeing that later. More generally, the counter design is based on some values being merged (summed) together (deltas) and other being reconciled as usual based on timestamps. This is a double-edged sword. It allows for quite nice performance properties, but it requires to be very careful not to sum two values that should not be summed. I don't believe this is something that should be done later (especially when we're not sure it can be done later in a satisfactory way). {quote} 3. To resolve this issue we have borrowed the implementation from CASSANDRA-1546 (with the added deadlock fix). {quote} Cool and thanks for the deadlock fix. > Increment counters > ------------------ > > Key: CASSANDRA-1072 > URL: https://issues.apache.org/jira/browse/CASSANDRA-1072 > Project: Cassandra > Issue Type: Sub-task > Components: Core > Reporter: Johan Oskarsson > Assignee: Kelvin Kakugawa > Attachments: CASSANDRA-1072.patch, increment_test.py, > Partitionedcountersdesigndoc.pdf > > > Break out the increment counters out of CASSANDRA-580. Classes are shared > between the two features but without the plain version vector code the > changeset becomes smaller and more manageable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.