On 4.2.12 20:07, Stefan Guggisberg wrote:
On Fri, Feb 3, 2012 at 7:44 PM, Michael Dürig<[email protected]>  wrote:

Hi,

While working on a new transient space implementation [1] I discovered that
recreating a list of changes from transient modifications to the hierarchy
is like opening a can of worms. Rethinking things it turned out, that it is
much simpler to directly consolidate the change log. This approach is much
more flexible and versatile since knowledge of the persisted hierarchy is
not necessary at all. Thus this approach should work equally well with
change logs for jr3, the Microkernel, Jackrabbit core, JCR and SPI.

The algorithm is capable of creating minimal change logs from any valid
change log (that is from any change log that can be applied to some
hierarchy).

I extensively tested the algorithm on randomly created change logs [3]. This
resulted in some interesting numbers: consolidation results in an decrease
of about 45% of these randomly generated change logs. Furthermore the
resulting size of the H2 database files *decreased up to 200%*!

The implementation [2] is currently part of the jackrabbit-microkernel
module in the sandbox. However it does not have any crucial dependencies so
it should be easy to adapt for other usages.

The gory details of the algorithm are based on some algebraic properties of
move operations on paths as detailed in the following paragraphs.

The change log consolidation algorithm implemented in the reduce method [2]
is based on algebraic properties of move operations on paths. The other
operations (add node, remove node and set property) are generalized to move
operations. Consolidation relies on reduction and commutation rules of move
operations.

A move operation resembles a map on a hierarchy (of nodes and properties). A
change log consisting of k move operations m_1 to m_k is thus the
composition of the individual moves: m_1 *, ..., * m_k (using * for function
composition: f(g(x)) = (f * g)(x)).

Some definitions, notation and propositions:

* Let NIL denote a path which never occurs in any hierarchy.

* Order on paths: let p, q be paths.
  - p<  q iff p != NIL and q != NIL and p is an ancestor of q.
  - p<= q iff p<  q or p == q != NIL

* Conflict of paths: let p, q be paths.
  - p ~ q (p conflicts with q) iff either p<= q or q<= p

* Substitution in paths: let p, q, r be paths.
  - [p ->  q]r = r if p is not an ancestor of r and
  - [p ->  q]r = s where s is the path resulting from replacing
    the ancestor p in r with q otherwise.

* Let p, q be paths. Then p:q denotes a move operation where the
  node at p is moved to a new node at q.

* Valid moves: leq p, q be paths.
  - p:q is valid iff p !~ q or p = q
  - if p:q is not valid, it is invalid

Invalid moves are exactly those which would result in a node being moved to
an ancestor/descendant of itself.

* Identity on moves: let p, q be paths.
  - p:q = ID iff p = q.

* Conflict on moves: let p, q, r, s be paths.
  - p:q ~ r:s (p:q conflicts with r:s) iff either p ~ r or p ~ s
    or q ~ r or q ~ s.

* Strict commutativity of moves: let m, n be moves.
  - m * n = n * m iff m !~ n

* Substitutions in moves: let p, q, r, s be paths.
  - [p ->  q]r:s = [p ->  q]r:[p ->  q]s

* Let p be a path and let +p denote an add node operation and let -p
  denote a remove node operation for a node at path p.
  - +p = NIL:p That is, adding a node is represented by a move from a
    unknown source.
  - p = p:NIL. That is, removing a node is represented by a move to an
    unknown sink.


Let m = p:q, n = r:s with p, q, r, s != NIL be valid moves with m != ID and
n != ID. Then the following reduction and commutation rules apply:

1.  p!~ r:  m * n = n * m
2.  p<  r:  illegal (since this implies q<= r which implies p ~ q and
            thus m invalid)
3.  p = r:  illegal (since this implies q<= r which implies p ~ q and
            this m invalid)
4.  p>  r:  does not commute if q<  s. Otherwise m * n = n * [r ->  s]m
5.  p!~ s:  m * n = n * m
6.  p<  s:  illegal (since this implies p ~ q and thus m invalid)
7.  p = s:  does not commute
8.  p>  s:  illegal (since p>  s implies there is an s already which
            will conflict with r:s)
9.  q!~ r:  m * n = n * m
10. q<  r:  m * n = [q ->  p]n * m
11. q = r:  m * n = p:s (transitivity of moves)
12. q>  r:  m * n = n * [r ->  s]m
13. q!~ s:  m * n = n * m
14. q<  s:  does not commute if p>  r. Otherwise m * n = [q ->  p]n * m
15. q = s:  illegal (since s conflicts with r:s)
16. q>  s:  illegal (since s conflicts with r:s)

Allowing add node and remove node operations the following additional
conditions apply:

Let m = p:q, n = r:s be valid moves with m != ID and n != ID. Then the
reduction and commutations rules 1. to 16. apply with extra conditions on
4., 10., 12. and 14.:

4'.  if s = NIL and q = NIL then m * n = -r. Otherwise if s = NIL then
     m, n do not commute.
10'. illegal if p = NIL
12'. if s = NIL then m * n = -r * -p
14'. illegal if p = NIL

sounds very complicated ;) i admit that i don't
fully understand it (ok, i didn't try very hard
and gave up somewhere in the middle).

It is actually much simpler than it looks. Have a look at the code and the inline comments for additional hints and my explanations below.

could you please provide a simple example?

What I basically did is identify all cases of how moves, adds and removes interact. This turned out to be much simpler when conceptually handling adds as moves from 'nowhere' and deletes as moves to 'nowhere' (aka Trash) and then specialising interaction rules for adds and deletes from there.

One kind of interaction is reductions like

>/a:/b, >/b:/c = >/a:/c
or
-/a, -/a/b = -/a
...

The other kind of interaction is commutation like

>/a:/b, >/c:/d = >/c:/d, >/a:/b
or
> /a:/x, +/x/y:{} = +/a/y:{}, >/a:/x
...

With a complete list of such interactions (see 1. - 16. in my initial post) it is possible to reduce a change log to its minimal equivalent by subsequent application of reductions and commutations: Given a reduced change list and a new operation inserted at index k, try to reduce it with the operation at k - 1. If this fails commute it with the operation at k - 1 and repeat at k - 1 until either a reduction has been found or commutation fails. In the latter case do the same going forward (to k + 1). If still no reduction could be found, we are done and the change list is minimized. Otherwise, recursively apply the reduction algorithm treating the result of the reduction as the new operation inserted into the rest of the change list. This is exactly what is implemented in reduce(List<Operation>, int index). See http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=markup

Example:
>/a:/x +/x/y:{} >/x:/b =  (commutation of 1 and 2, rule 14.)
+/a/y:{} >/a:/x >/x:/b =  (reduction of 2 and 3, rule 11.)
+/a/y:{} >/a:/b

Some key observations which make reduce() work:
- The interaction rules do not depend on the persisted hierarchy: If a change log can be applied to a hierarchy, its minimized equivalent can be applied to the same hierarchy. More over the observable effect on that hierarchy will be the same.

- The reduce algorithm is minimizing since whether a reduction can occur or not does not depend on 'where it occurs'. Above example could as well reduce as follows:

>/a:/x +/x/y:{} >/x:/b =  (commutation of 2 and 3, rule 12.)
>/a:/x >/x:/b +/b/y =     (reduction of 1 and 2, rule 11.)
>/a:/b +/b/y:{}

what is the benefit of this approach compared
to the imlementation in jackrabbit-core
(which does consolidate transient changes)?

- Versatility: it is not bound to any implementation, nor does it need the actual hierarchy (i.e. it is stateless). It is basically a realisation of a map from change logs to change logs.

- Self contained: everything in one single quite small class. No dependencies, no clutter.

- Much simpler.

- Provably sound and complete.

- Provably minimizing on the change logs.

- Portable since it doesn't depend on a specific underlying implementation.

- Reduction of up to 45% of the change log communicated from the upper layers to the Microkernel.

- In the case of H2, reduction of up to 200% of database size.

Apart from the benefits, it is a plain necessity for transient space implementations on top of the Microkernel: without proper consolidation users could end up not being able to save their (valid) transient modifications. Consider a first user doing

>/a:/t/x +/t/x/y:{} >/t/x:/b

if in the meanwhile another user removes x the save operation for the first user would fail.

After consolidation, above change log would look like

+/a/y:{} >/a:/b

and can be saved.

Michael


cheers
stefan


Michael


[1] http://markmail.org/message/qkkcvtmtapas2cx4
[2]
http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=markup
[3]
http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/test/java/org/apache/jackrabbit/state/ChangeLogFuzzTest.java?view=markup

Reply via email to