Re: Clustering persistent storage consistency issues

2012-01-27 Thread Berry van Halderen
On Fri, Jan 27, 2012 at 11:15 AM, Jukka Zitting  wrote:
> On Fri, Jan 27, 2012 at 9:50 AM, Berry van Halderen
> That shouldn't be possible since the first cluster node should be
> holding the cluster lock during the entire "update-persist" operation.
> Thus another cluster node shouldn't be able to make any concurrent
> changes.

Let me get back on that, because I'm looking at a possible fault in my
unit test.

\Berry


Re: Clustering persistent storage consistency issues

2012-01-27 Thread Jukka Zitting
Hi,

On Fri, Jan 27, 2012 at 9:50 AM, Berry van Halderen
 wrote:
> Before a JCR save action, the journal table in the database is
> consulted.  If there are logged actions performed by other Jackrabbit
> instances in the cluster than these actions are imported/replayed first
> (SharedItemStateManager.doExternalUpdate).  Then the actual changes
> are written to the database.  Where this fails, it that in between the
> check for external updates and actually writing stuff into the database,
> there might actually be update performed by another Jackrabbit instance.

That shouldn't be possible since the first cluster node should be
holding the cluster lock during the entire "update-persist" operation.
Thus another cluster node shouldn't be able to make any concurrent
changes.

BR,

Jukka Zitting


Clustering persistent storage consistency issues

2012-01-27 Thread Berry van Halderen
  Dear devs,

Recently I dove into some issues that plagued our production systems,
where consistency checks regularly showed that some nodes are not quite as
they should be.  Mainly orphaned nodes (nodes no longer being referenced
from their supposed parent), but also missing child nodes (nodes being
referenced from a parent, but non-existent in the database) and wrongly
located child nodes (nodes referenced as a child from a parent, but the
actual node indicates a different parent).  There were some others as
well, but these were most prominent.

Now there is a lock of additional logic on top of Jackrabbit, so it took
some time narrowing down the actual issue, but I've been able to confirm
the issue due to clustering.  We intensively use set-ups of Jackrabbit
that are clustered.  Multiple Jackrabbit instances, with a single MySQL
database which includes the binary storage (so no NFS needed).  We also
use no JCR locking and no transactions, and it may be that when using
those the issue becomes less that in our case.  But I believe that even
with using JCR locking and/or transactions there is still the possibility
of the above consistencies issues.

To show where I think it is going wrong let me go to the logic of the
clustering as I think it works.
Before a JCR save action, the journal table in the database is
consulted.  If there are logged actions performed by other Jackrabbit
instances in the cluster than these actions are imported/replayed first
(SharedItemStateManager.doExternalUpdate).  Then the actual changes
are written to the database.  Where this fails, it that in between the
check for external updates and actually writing stuff into the database,
there might actually be update performed by another Jackrabbit instance.
And in case these happen to be at the same node, we will simply overwrite
the changes of the other node.
This is not just a failed action, or even lost data, but actually
inconsistencies will occur.

I've been able to confirm this using a unit test bashing two cluster
nodes over jcr-rmi.

There are solutions to this, however they come at a hefty price, one of
them I've got working.  One of my first attempt was to introduce a
modification count column in the bundle data table.  This field is
actually a duplication of the modcount in the bundle itself, but then
when updating a bundle you'll update using "UPDATE .. WHERE id = ?
AND modcount = 'expected-old-modcount'"  You end up with some additional
plumbing to remember the old persisted modcount.  This does work
in part, no changes are overwritten.
However because Jackrabbit writes a changelog using individual updates
without a transaction, the other actions will go through, while it
should be all-or-nothing.  The options which works is to then use
transactions at this low level, but this is quite expensive.

I've tested this, and then all consistency issues go away.

There are other options, like locking tables, using a journal,
but all these options I think of are quite intrusive also, while a
simple fix at code level without creating cluster barriers seems
infeasible.  But again my current fix isn't that nice either.

But perhaps you have other ideas concerning this problem?

\Berry