On 21/10/2010 23:40, Peter Schuller wrote:
OK. Thanks for your answer. From an email exchange I had with Jonathan, all
this means that one should re-read its writes with quorum to make sure they
have not been overriden by timestamp-tie conflicts. I suggested to send
feedback to writting node (in the ACK) when such timestamps-tie conflict
happen. This would avoid having to double-check all writes for timestamp-tie
conflicts.

If multiple applications write to the same ColumnFamily/Tables, this
double-check is a must (unless a separate locking mecanism is implemented,
which would be more heavy).
I'm not sure I understand what you're trying to accomplish. Given that
you have no locking/synchronization mechanism external to Cassandra,
what is it that you are actually learning from re-reading the value? A
completed write at level QUOROM means it was successfully written and
that readers reading at QUOROM will see it unless the value has been
updated subsequently.
REM: I am not trying to make this discussion longer than necessary or to play semantics. I am not in to that at all and I appreciate the time you take to answer me, really.

Here is where I disagree with your conclusion when there is a timestamp tie. The write by node E will not be performed successfully (at quorum level), because of the tie resolution in favor of A somewhere in all the nodes between A and E.

Let's imagine that A initiates its column write at: 334450 ms with 'AAA' and timestamp 334450 ms Let's imagine that E initiates its column write at: 334451 ms with 'ZZZ'and timestamp 334450 ms
(E is the latest write)

Let's imagine that A reaches C at 334455 ms and performs its write.
Let's imagine that E reaches C at 334456 ms and attempts to performs its write. It will loose the timestamp-tie ('AAA' is greater than 'ZZZ').

Even if there is no further writting on that same column using timestamp 334450, a quorum read won't see that 'ZZZ' value (which is the latest attempt to write/update the column).

Node A will have completed a write a QUOROM level.
Node E will have completed a write a QUOROM level, but its value won't be registered and it won't be notified about it.

Hence, I disagree with your conclusion that a quorum write implies that it was successfully written. It is not the case for E. I know we could play semantics about the meaning of 'successful write' here, but that would not lead us nowhere and that is not my point.

But even if you re-read, that does not remove
the fundamental potential for a race condition (i.e., you still don't
know when you see the result of your read whether it wasn't just
ovewritten anyway just after you did your read).

Perhaps I'm misunderstanding what you're trying to do?
I totally agree there is a risk of race condition.

Here is what I am trying to do and why:

If there is no timestamp-tie between A and E, then I have no issue.

If there is a timestamp-tie, then the context becomes uncertain for E, out of the blue. If application E can't be sure about what has been saved in Cassandra, it cannot rely on what it has in memory. It is a vicious circle. It can't anticipate on the potential actions of A on the column too. This is unsual for any application, but may be this is the price to pay for using Cassandra. Fair enough.

If E is not informed of the timestamp tie, then it is left alone in the dark. Hence, this is why I say Cassandra is not deterministic to E. The result of a write is potentially non-deterministic in what it actually performs.

If E was aware that it lost a timestamp-tie, it would know that there is a possible gap between its internal memory representation and what it tried to save into Cassandra. That is, EVEN if there is no further write on that same column (or, in other words, regardless of any potential subsequent races).

If E was informed it lost a timestamp-tie, it could re-read the column (and let's assume that there is no further write in between, but this does not change anything to the argument). It could spot that its write for timestamp value 334450 ms failed, and also the reason why ('AAA' greater than 'ZZZ). It could operate a new write, which eventually could result in another timestamp-tie, but at least it would be informed about it too... It would have a safety net.

The case I am trying to cover is the case where the context for application E becomes invalid because of a successful write call to Cassandra without registration of 'ZZZ'. How can Cassandra call it a successful write, when in fact, it isn't for application E? I believe Cassandra should notify application E one way or another. This is why I mentioned an extra timestamp-tie flag in the write ACK sent by nodes back to node E.

The subsequent question I have is:

If 'value breaks timestamp-tie', how does Cassandra behave in case of updates? If there is a column with value 'AAA' at 334450 ms and an application explicitely wants to update this value to 'ZZZ' for 334450 ms, it seems like the timestamp-tie will prevent that. Hence, the update/mutation would be undeterministic to E. It seems like one should first delete the existing record and write a new one (and that could lead to race conditions and timestamp-ties too).

My conclusion so far is that a timestamp-tie boolean would help resolving potentially non-deterministic situations which can appear randomly at any time. Implementing locks would completely prevent these situations, but then, locks should be implemented for all writes on all tables if two application instance have access to it. It is a light/inexpensive versus heavy/costly safety net situation.

I think this should be documented, because engineers will hit that 'local' undeterministic issue for sure if two instances of their applications perform 'completed writes' in the same column family. Completed does not mean successful, even with quorum (or ALL). They ought to know it.

Jérôme

Reply via email to