On 21/10/2010 23:40, Peter Schuller wrote:
OK. Thanks for your answer. From an email exchange I had with Jonathan, all
this means that one should re-read its writes with quorum to make sure they
have not been overriden by timestamp-tie conflicts. I suggested to send
feedback to writting node (in the ACK) when such timestamps-tie conflict
happen. This would avoid having to double-check all writes for timestamp-tie
conflicts.
If multiple applications write to the same ColumnFamily/Tables, this
double-check is a must (unless a separate locking mecanism is implemented,
which would be more heavy).
I'm not sure I understand what you're trying to accomplish. Given that
you have no locking/synchronization mechanism external to Cassandra,
what is it that you are actually learning from re-reading the value? A
completed write at level QUOROM means it was successfully written and
that readers reading at QUOROM will see it unless the value has been
updated subsequently.
REM: I am not trying to make this discussion longer than necessary or to
play semantics. I am not in to that at all and I appreciate the time you
take to answer me, really.
Here is where I disagree with your conclusion when there is a timestamp
tie. The write by node E will not be performed successfully (at quorum
level), because of the tie resolution in favor of A somewhere in all the
nodes between A and E.
Let's imagine that A initiates its column write at: 334450 ms with 'AAA'
and timestamp 334450 ms
Let's imagine that E initiates its column write at: 334451 ms with
'ZZZ'and timestamp 334450 ms
(E is the latest write)
Let's imagine that A reaches C at 334455 ms and performs its write.
Let's imagine that E reaches C at 334456 ms and attempts to performs its
write. It will loose the timestamp-tie ('AAA' is greater than 'ZZZ').
Even if there is no further writting on that same column using timestamp
334450, a quorum read won't see that 'ZZZ' value (which is the latest
attempt to write/update the column).
Node A will have completed a write a QUOROM level.
Node E will have completed a write a QUOROM level, but its value won't
be registered and it won't be notified about it.
Hence, I disagree with your conclusion that a quorum write implies that
it was successfully written. It is not the case for E. I know we could
play semantics about the meaning of 'successful write' here, but that
would not lead us nowhere and that is not my point.
But even if you re-read, that does not remove
the fundamental potential for a race condition (i.e., you still don't
know when you see the result of your read whether it wasn't just
ovewritten anyway just after you did your read).
Perhaps I'm misunderstanding what you're trying to do?
I totally agree there is a risk of race condition.
Here is what I am trying to do and why:
If there is no timestamp-tie between A and E, then I have no issue.
If there is a timestamp-tie, then the context becomes uncertain for E,
out of the blue.
If application E can't be sure about what has been saved in Cassandra,
it cannot rely on what it has in memory. It is a vicious circle. It
can't anticipate on the potential actions of A on the column too.
This is unsual for any application, but may be this is the price to pay
for using Cassandra. Fair enough.
If E is not informed of the timestamp tie, then it is left alone in the
dark. Hence, this is why I say Cassandra is not deterministic to E. The
result of a write is potentially non-deterministic in what it actually
performs.
If E was aware that it lost a timestamp-tie, it would know that there is
a possible gap between its internal memory representation and what it
tried to save into Cassandra. That is, EVEN if there is no further write
on that same column (or, in other words, regardless of any potential
subsequent races).
If E was informed it lost a timestamp-tie, it could re-read the column
(and let's assume that there is no further write in between, but this
does not change anything to the argument). It could spot that its write
for timestamp value 334450 ms failed, and also the reason why ('AAA'
greater than 'ZZZ). It could operate a new write, which eventually could
result in another timestamp-tie, but at least it would be informed about
it too... It would have a safety net.
The case I am trying to cover is the case where the context for
application E becomes invalid because of a successful write call to
Cassandra without registration of 'ZZZ'. How can Cassandra call it a
successful write, when in fact, it isn't for application E? I believe
Cassandra should notify application E one way or another. This is why I
mentioned an extra timestamp-tie flag in the write ACK sent by nodes
back to node E.
The subsequent question I have is:
If 'value breaks timestamp-tie', how does Cassandra behave in case of
updates? If there is a column with value 'AAA' at 334450 ms and an
application explicitely wants to update this value to 'ZZZ' for 334450
ms, it seems like the timestamp-tie will prevent that. Hence, the
update/mutation would be undeterministic to E. It seems like one should
first delete the existing record and write a new one (and that could
lead to race conditions and timestamp-ties too).
My conclusion so far is that a timestamp-tie boolean would help
resolving potentially non-deterministic situations which can appear
randomly at any time. Implementing locks would completely prevent these
situations, but then, locks should be implemented for all writes on all
tables if two application instance have access to it. It is a
light/inexpensive versus heavy/costly safety net situation.
I think this should be documented, because engineers will hit that
'local' undeterministic issue for sure if two instances of their
applications perform 'completed writes' in the same column family.
Completed does not mean successful, even with quorum (or ALL). They
ought to know it.
Jérôme