> I may have been unclear about the meaning of timestamp in Cassandra. I was > under the impression that any given data with the same key value and two > different timestamps would result in two 'rows'. From what you say, it does > not seem to be the case. Do you confirm? (In other words, whoever has the > greatest timestamp destroys the previous records with lower timestamps).
Yes (other than the use of the word "row"). An "insert" of a column (a column being essentially a key/value pair) causes the key to be associated with that value. If there was already a column with the same key, it is replaced. If not, a column is added. If you have a situation where conflicting writes cannot be allowed, you'll either have to have some strong co-ordination of writers outside of Cassandra or else "serialize" the problem by writing intended changes to some kind of queue/data structure that some particular guaranteed-to-be-alone Cassandra client processes in batch mode independently (thereby avoiding the need for co-ordination). > I know I am boxing a corner case, but I have not seen in the documentation > that latest timestamp erases/overwrittes previous data. Now, I may have > missed something here. May be I did not rub my eyes enough or the coffee was > not operating yet. I'm not sure where it's most clearly stated and I don't remember how I figured these things out originally. I think the closest thing on the wiki would be: http://wiki.apache.org/cassandra/DataModel It does mention that timestamps are used for conflict resolution but does not really dwell on the issue, and the remainder elides timestamps. So perhaps it's easy to miss. I also notice that the phrasing is such that it is not entirely unreasonably to interpret it like it seems you have. At the same time that page is somewhat of a mix between internal models and the model exposed to clients, so I'm not sure how best to improve the phrasing. Riptano's recently added documentation may be worth reading: http://www.riptano.com/docs/0.6.5/index Though upon cursory examination I'm not sure whether it is more clear on this particular point. > i) That most recent timestamp overwrittes previous entries with lower > timestamp. This can definitely be clarified. > ii) If case of timestamp ties, value breaks ties. If this is indeed intended to be a guarantee and not an artifact of the current implementation (anyone want to comment - jbellis?). > iii) What about ColumnFamilies and SuperColumnFamilies? Do we have the > guarantee that, in case of timestamp ties, the whole record of the winner > is register (I would assume yes, of course) Individual columns may be inserted into a SuperColumn so it is not inserted as one compound value. If writers A and B both do concurrent insertions to a SuperColumn where A writes column C1 and B writes column C1 and C2, B's write of C2 will always stick, but C1 will be subject to individual column conflict resolution. Keep in mind however that typically timestamps are not allocated/chosen on a per-column basis by a client. It does occur to me that at this point you may actually have issues with the timestamp tie and value based conflict resolution if you are expecting a set of column updates to either apply or not apply as a group (with respect to some other group of updates). That's a bit subtle. Also on the topic of granularity, entire super columns and entire rows may be deleted without individually referring to all columns. In those cases, deletes span entire rows or supercolumns rather than individual columns. -- / Peter Schuller