Re: [jr3] Clustering: Scalable Writes / Asynchronous Change Merging

Raffaele Sena Tue, 19 Oct 2010 11:15:55 -0700

Thomas, I am not sure if you are proposing to add a more "asynchronous"
PersistenceManager or completely change the behaviour of the current one.

While I would love a system that can scale well for reads and writes, and
while I understand that there is a class of applications that are well
served by "eventual consistency" I think a system like Jackrabbit, mainly
used for content publishing and delivery over the web will suffer from such
implementation (but I may be wrong).

Here are a few cases where I think eventual consistency doesn't work (and I
am assuming there is a web application like Sling or CMIS that exposes the
JCR APIs via REST, and that each webserver is connected to a different JCR
instance) :

1) A pure REST system where a client uploads a file via POST/PUT (write),
receives a new URL in return and fetches the new node information via GET
(read).

2) A blog system where the publisher upload an image and the content page
refreshes to show the image

3) A real-time collaboration system (like a chat room) where a user uploads
a file and broadcasts the file URL and all other participants try to
view/download the file.

#1 and #2 could potentially be non-problems assuming the cluster is behind a
load balancer configured for session affinity (so two requests from the same
client will go to the same server) and/or by making sure client and server
supports keep-alive (again the same client is hitting the same server), but
actually I would say these are more "hidden" problems than non-problems. One
day the request will go through a proxy that doesn't support keep-alive, or
a browser that is misconfigured, or the load balancer will try to better
"balance" requests and these will become problems.

#3 I would think is always a problem, because multiple clients will always
hit different webservers, and so different JCR instances.

I agree that keeping the interval between "merges" very short will mitigate
the problem, but again, there will always be the chance that one server
slows down and doesn't synchronize in time and the problem will appear (and
it will be very difficult to diagnose).

Some of the NoSQL implementations address this problem by making the client
deal with inconsistencies and this means most of the times having the client
read from multiple servers and then compare the results. This works, but
again if I want a system that is always consistent isn't it easier to keep
consistency at the write level ?

Also, I think that both the current system and this new system work well
with a reasonably small cluster (i.e. <= 10 machines) but may degrade
quickly if the cluster gets much larger (and this is just guts feeling, I
really don't have any data to corroborate this :)

If the goal is to enable much larger clusters probably the system should be
designed to support some kind of sharding/partitioning (i.e. if I have a
cluster with 100 machines do I really want to have each machine replicate
the full repository or should nodes be replicated only on a small number of
machines ?). Of course this open a new can of worms because there is a set
of operations that don't work well on a partitioned system. Search is
probably fine, because it should be easy to do multiple searches on the
different partitions and merge the results. Transactions may be tricky
because you may be dealing with nodes in different partitions and you'll
need to synchronize operations between them.

Note that pure NoSQL systems solve this problem by saying "well you can't do
that" or "do it in the client" and they expose a very limited set of
operations, i.e. only put or get, so that they can scale. But I suspect that
trying to apply some of the NoSQL scalabilty tricks to a system with a rich
set of operations like JCR is going to be hard.

But again, as long as there is the option to configure the system to work
"synchronously" or "asynchronously" depending on the use cases (and as long
as the "synchronous" version is not penalized by the new design) it's worth
pursuing.

-- Raffaele

On Tue, Oct 19, 2010 at 3:24 AM, Thomas Müller <thomas.muel...@day.com>wrote:

> The current Jackrabbit clustering doesn't scale well for writes
> because all cluster nodes use the same persistent storage. Even if
> persistence storage is clustered, the cluster journal relies on
> changes being immediately visible in all nodes. That means Jackrabbit
> clustering can scale well for reads, however it can't scale well for
> writes. This is a property Jackrabbit clustering shares with most
> clustering solutions for relational databases. Still, it would make
> sense to solve this problem for Jackrabbit 3.
>
> == Current Jackrabbit Clustering ==
>
> [Cluster Node 1]  <--> | Shared
> [Cluster Node 2]  <--> | Storage
>
> I propose a different architecture in Jackrabbit 3:
>
> == Jackrabbit 3 Clustering ==
>
> [Cluster Node 1]  <-->  [ Local Storage ]
> [Cluster Node 2]  <-->  [ Local Storage ]
>
> Please note that shared node storage is still supported for things
> like the data store, but no longer required or supported for the
> persistent storage (currently called PersistenceManager).
>
> Instead, the cluster nodes should merge each others changes
> asynchronously (except operations like JCR locking, plus potentially
> other operations that are not that common; maybe even node move). With
> "asynchronously" I mean usually within a second or so, but potentially
> minutes later depending on configuration, latency between cluster
> nodes, and possibly load. Similar to NoSQL systems.
>
> == Unique Change Set Ids ==
>
> For my idea to work, we need globally unique change set ids. Each
> change set is stored in the event journal, and can be retrieved later
> and sent to other cluster nodes. I suggest that events are grouped
> into change sets so that all events within the same session.save()
> operation have the same change set id. We could also call it
> transaction id (I don't mind). Change set ids need to be unique across
> all cluster nodes. That means, the change set id could be:
>
> changeSetId = nanosecondsSince1970 * totalClusterNodes + clusterNodeId
>
> Let's say if you have 2 cluster nodes currently and expect to add a
> few more later (up to 10), you could use the formula:
>
> changeSetId = nanosecondsSince1970 * 10 + clusterNodeId
>
> To support more than 10 cluster nodes the formula would need to be
> changed (that could be done at runtime). It doesn't necessarily need
> to be this formula, but the change set id should represent the time
> when the change occurred, and it should be unique.
>
> == How to Merge Changes ==
>
> Changes need to be merged so that all cluster nodes end up with the
> same data (you could call this "eventually consistent").
>
> New changes are not problematic can be applied directly. This includes
> local changes of course, because the change set id of local changes is
> always newer than the last change.
>
> Changes with change set ids in the future are delayed. Cluster nodes
> should have reasonably synchronized clocks (it doesn't need to be
> completely exact, but it should be reasonably accurate, so that such
> delayed events are not that common).
>
> So the only tricky thing are changes that happened in the past, in
> another cluster node, if the same data was changed in this cluster
> node (or another cluster node) afterwards (afterwards mean with a
> higher change set id). To find out that a change happened in the past,
> each node needs to at least know the change set id of the last change.
> There are multiple solutions:
>
> == Solution A: Node Granularity, Ignore Old Changes ==
>
> Here, each node only need to know when it was changed the last time.
> If the change set id is older than that, changes to its properties and
> child node list are ignored. That means, if two cluster nodes
> concurrently change data in a node, the newer change wins, and the
> older change is lost. This is a bit problematic for example when
> concurrently adding child nodes: Only the added child node of the
> newer change survives, which is probably unexpected.
>
> == Solution B: Merge Old Changes ==
>
> Here, we need an efficient way to load the list of changes (events) to
> a node since a certain time. Now, when merging a change, the old
> versions of the node need to be loaded or re-constructed, and then the
> old change needs to be applied as if it already happened before the
> newer change. Let's say we know about the two versions:
>
> v1: node a; child nodes b, c, d; properties x=1, y=2
> event t9: add child node e, set property x=2, remove property y
> v9: node a; child nodes b, c, d, e; properties x=2
>
> The change to merge happend in the past:
>
> event t3: add child node f, remove child node b, set property y=3,
> remove property x, set property z=1
>
> Now the result would be:
>
> v9(new): node a, child nodes c, d, e, f; properties x=2, z=1
>
> There are other ways to merge the changes of course (for example, only
> merge added child / removed child nodes). I think there are some
> tricky problems, however I think it's relatively easy to ensure the
> algorithm is correct using a few randomized test cases. No matter what
> the merge rules are, they would need to be constructed so that at the
> end of the day, each cluster node would end up with the exact same
> data, for all possible combination and order of changes.
>
> Regards,
> Thomas
>

Re: [jr3] Clustering: Scalable Writes / Asynchronous Change Merging

Reply via email to