Re: [jr3] Clustering: Scalable Writes / Asynchronous Change Merging

Ian Boston Fri, 22 Oct 2010 04:02:16 -0700

On 19 Oct 2010, at 11:24, Thomas Müller wrote:

> changeSetId = nanosecondsSince1970 * totalClusterNodes + clusterNodeId


I have spent some time doing experiments on this, here are some observations 
based on those experiments. (could be totally irrelevant to this discussion, 
sorry for the noise if it is)

A sequential number helps since you know the order and can maintain lookahead 
and lookback buffers to ensure you replay in sequence.
Performing master election and then working mean time between all nodes in the 
cluster and offset of time averaging a request response travel time gives 
offset of all nodes relative to one another allowing real time clock sync with 
diffs, you do need to do more than one message in the master election to 
generate a good measurement of time offset. ( ideally measure until mean is 
stable)

Once you have that, I would recommend BigInt storage and reserve 4 digits for 
the cluster ID.
I would also not use nano seconds but make the ID generator in each JVM 
synchronised and start a rapid counter reserving 10K ticks per ms. Server 
offsets are never going to be ns accurate and who knows which conflicting 
change is right in the ns range. so the structure of the change set ID is:

((((serverOffset+msSince1970)*1000)+perMSCounter)*1000)+cluserNodeId

This generates a cluster wide sequence where each counter runs independently on 
each server only blocked by  very small sync block. The sequence has holes in 
it. Interestingly if you can accept that the sequence is reliable (ie after a 
certain time delay all events have arrived and none were lost), it can be used 
as the basis for the sequence in the current (2x) ClusterNode implementation as 
the replay of a Journal does not require a hole free sequence. (AFAICT)  The 
sequence can be checked for lost events if each subsequent event contains the 
previous ID.

The clusterNodeID is taken from a position in a List of clusterNodeNames that 
is shared on master election, the only thing the master does is co-ordinate 
that list and determine clock offsets. That also removes any management of the 
range, unlike Cassandra which shards the ID range over the cluster from the 
start.

The other thing I have found is necessary is to propagate version information 
of changes so when a remote node receives an event it knows, that if it doesn't 
have the current version locally, it knows which node within the cluster does, 
giving just-in-time-constent rather than eventially-consistent.

I am sorry if this is old information, I havent been following the JR3 
discussions too closely, but I hope it helps. Once other thing you might want 
to have a quick look at is the Quorum implementation in Cassandra which might 
be relevant to making certain areas more acid than basic.


What you do with the data once you have the sequence and event distribution is 
another matter, I guess you are thinking of distributing the content over the 
entire cluster rather than making it local to each node.

HTH
Ian

Re: [jr3] Clustering: Scalable Writes / Asynchronous Change Merging

Reply via email to