On 19 Oct 2010, at 11:24, Thomas Müller wrote: > changeSetId = nanosecondsSince1970 * totalClusterNodes + clusterNodeId
I have spent some time doing experiments on this, here are some observations based on those experiments. (could be totally irrelevant to this discussion, sorry for the noise if it is) A sequential number helps since you know the order and can maintain lookahead and lookback buffers to ensure you replay in sequence. Performing master election and then working mean time between all nodes in the cluster and offset of time averaging a request response travel time gives offset of all nodes relative to one another allowing real time clock sync with diffs, you do need to do more than one message in the master election to generate a good measurement of time offset. ( ideally measure until mean is stable) Once you have that, I would recommend BigInt storage and reserve 4 digits for the cluster ID. I would also not use nano seconds but make the ID generator in each JVM synchronised and start a rapid counter reserving 10K ticks per ms. Server offsets are never going to be ns accurate and who knows which conflicting change is right in the ns range. so the structure of the change set ID is: ((((serverOffset+msSince1970)*1000)+perMSCounter)*1000)+cluserNodeId This generates a cluster wide sequence where each counter runs independently on each server only blocked by very small sync block. The sequence has holes in it. Interestingly if you can accept that the sequence is reliable (ie after a certain time delay all events have arrived and none were lost), it can be used as the basis for the sequence in the current (2x) ClusterNode implementation as the replay of a Journal does not require a hole free sequence. (AFAICT) The sequence can be checked for lost events if each subsequent event contains the previous ID. The clusterNodeID is taken from a position in a List of clusterNodeNames that is shared on master election, the only thing the master does is co-ordinate that list and determine clock offsets. That also removes any management of the range, unlike Cassandra which shards the ID range over the cluster from the start. The other thing I have found is necessary is to propagate version information of changes so when a remote node receives an event it knows, that if it doesn't have the current version locally, it knows which node within the cluster does, giving just-in-time-constent rather than eventially-consistent. I am sorry if this is old information, I havent been following the JR3 discussions too closely, but I hope it helps. Once other thing you might want to have a quick look at is the Quorum implementation in Cassandra which might be relevant to making certain areas more acid than basic. What you do with the data once you have the sequence and event distribution is another matter, I guess you are thinking of distributing the content over the entire cluster rather than making it local to each node. HTH Ian