[ 
https://issues.apache.org/jira/browse/OAK-7932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Axel Hanikel updated OAK-7932:
------------------------------
    Description: 
h1. Outline

This issue documents some proof-of-concept work for adapting the segment tar 
nodestore to a
 distributed environment. The main idea is to adopt an actor-like model, 
meaning:

*   Communication between actors (services) is done exclusively via messages.
*   An actor (which could also be a thread) processes one message at a time, 
avoiding sharing
     state with other actors as far as possible.
*   Nodestates are kept in RAM and are written to external storage lazily only 
for disaster recovery.
*   A nodestate is identified by a uuid, which in turn is a hash on its 
serialised string representation.
*   As RAM is a very limited resource, different actors own their share of the 
total uuid space.
*   An actor might also cache a few nodestates which it does not own but which 
it uses often (such as
     the one containing the root node)

h1. Implementation

*The first idea* was to use the segment node store, and ZeroMQ for 
communication because it seems to be a high-quality and
 easy to use implementation. A major drawback is that the library is written in 
C and the Java
 library which does the JNI stuff seems hard to set up and did not work for me. 
There is a native
 Java implementation of the ZeroMQ protocol, aptly called jeromq, which seems 
to work well so far. This approach is probably not
 pursued because due to the nature of how things are stored in segments, they 
are hard to cache (it seems like a large part of the repository
will eventually end up in the cache).

*A second implementation*, was a simple
nodestore implementation which was kind of a dual to the segment store in the 
sense that it is on the other end
of the compactness spectrum. The segment store is very dense and avoids 
duplication whereever possible.
The nodestore in this implementation, however, is quite redundant: Every 
nodestate gets its own UUID (a hash of the serialised
nodestate) and is saved together with its properties, similar to the document 
node store.

Here is what a serialised nodestate looks like:
{noformat}
begin ZeroMQNodeState
begin children
allow   856d1356-7054-3993-894b-a04426956a78
end children
begin properties
jcr:primaryType <NAME> = rep:ACL
:childOrder <NAMES> = [allow]
end properties
end ZeroMQNodeState
{noformat}

This verbose format is good for debugging but expensive to generate and parse, 
so it may be replaced with a binary format at some point. But it shows how 
child nodestates are referenced and how properties are represented. Binary 
properties (not shown here) are represented by a reference to the blob store.

The redundancy (compared with the segment store with its fine-grained record 
structure) wastes space, but on the other hand garbage
collection (yet unimplemented) is easier because there is no segment that needs 
to be rewritten to get rid of data that is no
longer referenced; unreferenced nodes can just be deleted. This implementation 
still has bugs, but being much simpler
than the segment store, it can eventually be used to experiment with different 
configurations and examine their
performance.

 *A third implementation*, at 
[https://github.com/ahanikel/jackrabbit-oak/tree/zeromq-nodestore] is based on 
the second one but uses a more compact serialisation format, which records the 
differences between a new nodestate and its base state, similar to what 
{{.compareAgainstBaseState()}} produces in code. It looks like this:

{noformat}
cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n: 823f2252-db37-b0ca-3f7e-09cd073b530a 
00000000-0000-0000-0000-000000000000
cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n+ root 
00000000-0000-0000-0000-000000000000
cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n+ checkpoints 
00000000-0000-0000-0000-000000000000
cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n+ blobs 
00000000-0000-0000-0000-000000000000
cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n!
cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 journal golden 
823f2252-db37-b0ca-3f7e-09cd073b530a
cc97d19b-efa9-4f64-a5e9-a94d9418f271-151 checkpoints golden 
00000000-0000-0000-0000-000000000000
cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n: 026cb997-d3f7-2c9f-a5ad-d7396b86c267 
00000000-0000-0000-0000-000000000000
cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 p+ :clusterId <STRING> = 
0cb833ee-2678-4383-93dd-fe993cbfd56a
cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n!
cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n: 3ceb887c-0875-1171-94d2-df9bc8a57edd 
00000000-0000-0000-0000-000000000000
cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n+ :clusterConfig 
026cb997-d3f7-2c9f-a5ad-d7396b86c267
cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n!
cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n: ce8ace33-9d80-d6e1-c4ba-f0882f75e9c9 
823f2252-db37-b0ca-3f7e-09cd073b530a
cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n^ root 
3ceb887c-0875-1171-94d2-df9bc8a57edd 00000000-0000-0000-0000-000000000000
cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n!
cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 journal golden 
ce8ace33-9d80-d6e1-c4ba-f0882f75e9c9
cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64+ 
516f4f64-4e5c-432e-b9bf-0267c1e96541
cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64d 
ClVVRU5DT0RFKDEpICAgICAgICAgICAgICAgQlNEIEdlbmVyYWwgQ29tbWFuZHMgTWFudWFsICAg
cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64d 
ICAgICAgICAgICBVVUVOQ09ERSgxKQoKTghOQQhBTQhNRQhFCiAgICAgdQh1dQh1ZAhkZQhlYwhj
cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64d 
bwhvZAhkZQhlLCB1CHV1CHVlCGVuCG5jCGNvCG9kCGRlCGUgLS0gZW5jb2RlL2RlY29kZSBhIGJp
cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64d 
bmFyeSBmaWxlCgpTCFNZCFlOCE5PCE9QCFBTCFNJCElTCFMKICAgICB1CHV1CHVlCGVuCG5jCGNv
cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64d 
CG9kCGRlCGUgWy0ILW0IbV0gWy0ILW8IbyBfCG9fCHVfCHRfCHBfCHVfCHRfCF9fCGZfCGlfCGxf
cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64d 
CGVdIFtfCGZfCGlfCGxfCGVdIF8Ibl8IYV8IbV8IZQogICAgIHUIdXUIdWQIZGUIZWMIY28Ib2QI
cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64d 
ZGUIZSBbLQgtYwhjaQhpcAhwcwhzXSBbXwhmXwhpXwhsXwhlIF8ILl8ILl8ILl0KICAgICB1CHV1
cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64d 
CHVkCGRlCGVjCGNvCG9kCGRlCGUgWy0ILWkIaV0gLQgtbwhvIF8Ib18IdV8IdF8IcF8IdV8IdF8I
cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64d 
X18IZl8IaV8IbF8IZSBbXwhmXwhpXwhsXwhlXQo=
cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64!
{noformat}
n: starts a nodestate. n+  and n^  both reference a nodestate which has been 
defined before. The first one says that a node has been created, the second one 
means an existing node has been changed, indicating the nodestate's uuid and 
the uuid of the basestate. This is slightly different from the format described 
in OAK-7849 which doesn't record the uuids. b64+ adds a new base64 encoded blob 
with the given uuid followed by the encoded data.


  was:
h1. Outline

This issue documents some proof-of-concept work for adapting the segment tar 
nodestore to a
 distributed environment. The main idea is to adopt an actor-like model, 
meaning:

*   Communication between actors (services) is done exclusively via messages.
*   An actor (which could also be a thread) processes one message at a time, 
avoiding sharing
     state with other actors as far as possible.
*   Nodestates are kept in RAM and are written to external storage lazily only 
for disaster recovery.
*   A nodestate is identified by a uuid, which in turn is a hash on its 
serialised string representation.
*   As RAM is a very limited resource, different actors own their share of the 
total uuid space.
*   An actor might also cache a few nodestates which it does not own but which 
it uses often (such as
     the one containing the root node)

h1. Implementation

*The first idea* was to use the segment node store, and ZeroMQ for 
communication because it seems to be a high-quality and
 easy to use implementation. A major drawback is that the library is written in 
C and the Java
 library which does the JNI stuff seems hard to set up and did not work for me. 
There is a native
 Java implementation of the ZeroMQ protocol, aptly called jeromq, which seems 
to work well so far. This approach is probably not
 pursued because due to the nature of how things are stored in segments, they 
are hard to cache (it seems like a large part of the repository
will eventually end up in the cache).

*A second implementation*, at 
[https://github.com/ahanikel/jackrabbit-oak/tree/zeromq-nodestore] is a simple
nodestore implementation which is kind of a dual to the segment store in the 
sense that it is on the other end
of the compactness spectrum. The segment store is very dense and avoids 
duplication whereever possible.
The nodestore in this implementation, however, is quite redundant: Every 
nodestate gets its own UUID (a hash of the serialised
nodestate) and is saved together with its properties, similar to the document 
node store.

Here is what a serialised nodestate looks like:
{noformat}
begin ZeroMQNodeState
begin children
allow   856d1356-7054-3993-894b-a04426956a78
end children
begin properties
jcr:primaryType <NAME> = rep:ACL
:childOrder <NAMES> = [allow]
end properties
end ZeroMQNodeState
{noformat}

This verbose format is good for debugging but expensive to generate and parse, 
so it may be replaced with a binary format at some point. But it shows how 
child nodestates are referenced and how properties are represented. Binary 
properties (not shown here) are represented by a reference to the blob store.

The redundancy (compared with the segment store with its fine-grained record 
structure) wastes space, but on the other hand garbage
collection (yet unimplemented) is easier because there is no segment that needs 
to be rewritten to get rid of data that is no
longer referenced; unreferenced nodes can just be deleted. This implementation 
still has bugs, but being much simpler
than the segment store, it can eventually be used to experiment with different 
configurations and examine their
performance.

 *A third implementation*, at 
[https://github.com/ahanikel/jackrabbit-oak/tree/zeromq-nodestore-kafka-finegrained-checkpoints-1.38.0]
 is based on the second one but uses a more compact serialisation format, which 
records the differences between a new nodestate and its base state, similar to 
what {{.compareAgainstBaseState()}} produces in code. It looks like this:

{noformat}
n: 823f2252-db37-b0ca-3f7e-09cd073b530a 00000000-0000-0000-0000-000000000000
n+ root 00000000-0000-0000-0000-000000000000
n+ checkpoints 00000000-0000-0000-0000-000000000000
n+ blobs 00000000-0000-0000-0000-000000000000
n!
journal golden 823f2252-db37-b0ca-3f7e-09cd073b530a
checkpoints golden 00000000-0000-0000-0000-000000000000
n: 394cdbf7-c75b-1feb-23f9-2ca93a8d97eb 00000000-0000-0000-0000-000000000000
p+ :clusterId <STRING> = 2441f7f8-5ce8-493f-9b27-c4bca802ea0b
n!
n: 794ccfa2-4372-5b05-c54f-d2b6b6d2df03 00000000-0000-0000-0000-000000000000
n+ :clusterConfig 394cdbf7-c75b-1feb-23f9-2ca93a8d97eb
n!
n: fe9f9008-7734-bd69-17d0-ccecf4003323 823f2252-db37-b0ca-3f7e-09cd073b530a
n^ root 794ccfa2-4372-5b05-c54f-d2b6b6d2df03 
00000000-0000-0000-0000-000000000000
n!
journal golden fe9f9008-7734-bd69-17d0-ccecf400332
b64+ 516f4f64-4e5c-432e-b9bf-0267c1e96541
b64d 
ClVVRU5DT0RFKDEpICAgICAgICAgICAgICAgQlNEIEdlbmVyYWwgQ29tbWFuZHMgTWFudWFsICAg
b64d 
ICAgICAgICAgICBVVUVOQ09ERSgxKQoKTghOQQhBTQhNRQhFCiAgICAgdQh1dQh1ZAhkZQhlYwhj
b64d 
bwhvZAhkZQhlLCB1CHV1CHVlCGVuCG5jCGNvCG9kCGRlCGUgLS0gZW5jb2RlL2RlY29kZSBhIGJp
b64d 
bmFyeSBmaWxlCgpTCFNZCFlOCE5PCE9QCFBTCFNJCElTCFMKICAgICB1CHV1CHVlCGVuCG5jCGNv
b64d 
CG9kCGRlCGUgWy0ILW0IbV0gWy0ILW8IbyBfCG9fCHVfCHRfCHBfCHVfCHRfCF9fCGZfCGlfCGxf
b64d 
CGVdIFtfCGZfCGlfCGxfCGVdIF8Ibl8IYV8IbV8IZQogICAgIHUIdXUIdWQIZGUIZWMIY28Ib2QI
b64d 
ZGUIZSBbLQgtYwhjaQhpcAhwcwhzXSBbXwhmXwhpXwhsXwhlIF8ILl8ILl8ILl0KICAgICB1CHV1
b64d 
CHVkCGRlCGVjCGNvCG9kCGRlCGUgWy0ILWkIaV0gLQgtbwhvIF8Ib18IdV8IdF8IcF8IdV8IdF8I
b64d X18IZl8IaV8IbF8IZSBbXwhmXwhpXwhsXwhlXQo=
b64!
{noformat}
n: starts a nodestate. n+  and n^  both reference a nodestate which has been 
defined before. The first one says that a node has been created, the second one 
means an existing node has been changed, indicating the nodestate's uuid and 
the uuid of the basestate. This is slightly different from the format described 
in OAK-7849 which doesn't record the uuids. b64+ adds a new base64 encoded blob 
with the given uuid followed by the encoded data.



> A distributed node store for the cloud
> --------------------------------------
>
>                 Key: OAK-7932
>                 URL: https://issues.apache.org/jira/browse/OAK-7932
>             Project: Jackrabbit Oak
>          Issue Type: Wish
>          Components: segment-tar
>            Reporter: Axel Hanikel
>            Assignee: Axel Hanikel
>            Priority: Minor
>
> h1. Outline
> This issue documents some proof-of-concept work for adapting the segment tar 
> nodestore to a
>  distributed environment. The main idea is to adopt an actor-like model, 
> meaning:
> *   Communication between actors (services) is done exclusively via messages.
> *   An actor (which could also be a thread) processes one message at a time, 
> avoiding sharing
>      state with other actors as far as possible.
> *   Nodestates are kept in RAM and are written to external storage lazily 
> only for disaster recovery.
> *   A nodestate is identified by a uuid, which in turn is a hash on its 
> serialised string representation.
> *   As RAM is a very limited resource, different actors own their share of 
> the total uuid space.
> *   An actor might also cache a few nodestates which it does not own but 
> which it uses often (such as
>      the one containing the root node)
> h1. Implementation
> *The first idea* was to use the segment node store, and ZeroMQ for 
> communication because it seems to be a high-quality and
>  easy to use implementation. A major drawback is that the library is written 
> in C and the Java
>  library which does the JNI stuff seems hard to set up and did not work for 
> me. There is a native
>  Java implementation of the ZeroMQ protocol, aptly called jeromq, which seems 
> to work well so far. This approach is probably not
>  pursued because due to the nature of how things are stored in segments, they 
> are hard to cache (it seems like a large part of the repository
> will eventually end up in the cache).
> *A second implementation*, was a simple
> nodestore implementation which was kind of a dual to the segment store in the 
> sense that it is on the other end
> of the compactness spectrum. The segment store is very dense and avoids 
> duplication whereever possible.
> The nodestore in this implementation, however, is quite redundant: Every 
> nodestate gets its own UUID (a hash of the serialised
> nodestate) and is saved together with its properties, similar to the document 
> node store.
> Here is what a serialised nodestate looks like:
> {noformat}
> begin ZeroMQNodeState
> begin children
> allow 856d1356-7054-3993-894b-a04426956a78
> end children
> begin properties
> jcr:primaryType <NAME> = rep:ACL
> :childOrder <NAMES> = [allow]
> end properties
> end ZeroMQNodeState
> {noformat}
> This verbose format is good for debugging but expensive to generate and 
> parse, so it may be replaced with a binary format at some point. But it shows 
> how child nodestates are referenced and how properties are represented. 
> Binary properties (not shown here) are represented by a reference to the blob 
> store.
> The redundancy (compared with the segment store with its fine-grained record 
> structure) wastes space, but on the other hand garbage
> collection (yet unimplemented) is easier because there is no segment that 
> needs to be rewritten to get rid of data that is no
> longer referenced; unreferenced nodes can just be deleted. This 
> implementation still has bugs, but being much simpler
> than the segment store, it can eventually be used to experiment with 
> different configurations and examine their
> performance.
>  *A third implementation*, at 
> [https://github.com/ahanikel/jackrabbit-oak/tree/zeromq-nodestore] is based 
> on the second one but uses a more compact serialisation format, which records 
> the differences between a new nodestate and its base state, similar to what 
> {{.compareAgainstBaseState()}} produces in code. It looks like this:
> {noformat}
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n: 
> 823f2252-db37-b0ca-3f7e-09cd073b530a 00000000-0000-0000-0000-000000000000
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n+ root 
> 00000000-0000-0000-0000-000000000000
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n+ checkpoints 
> 00000000-0000-0000-0000-000000000000
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n+ blobs 
> 00000000-0000-0000-0000-000000000000
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n!
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 journal golden 
> 823f2252-db37-b0ca-3f7e-09cd073b530a
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-151 checkpoints golden 
> 00000000-0000-0000-0000-000000000000
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n: 
> 026cb997-d3f7-2c9f-a5ad-d7396b86c267 00000000-0000-0000-0000-000000000000
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 p+ :clusterId <STRING> = 
> 0cb833ee-2678-4383-93dd-fe993cbfd56a
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n!
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n: 
> 3ceb887c-0875-1171-94d2-df9bc8a57edd 00000000-0000-0000-0000-000000000000
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n+ :clusterConfig 
> 026cb997-d3f7-2c9f-a5ad-d7396b86c267
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n!
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n: 
> ce8ace33-9d80-d6e1-c4ba-f0882f75e9c9 823f2252-db37-b0ca-3f7e-09cd073b530a
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n^ root 
> 3ceb887c-0875-1171-94d2-df9bc8a57edd 00000000-0000-0000-0000-000000000000
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 n!
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-26 journal golden 
> ce8ace33-9d80-d6e1-c4ba-f0882f75e9c9
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64+ 
> 516f4f64-4e5c-432e-b9bf-0267c1e96541
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64d 
> ClVVRU5DT0RFKDEpICAgICAgICAgICAgICAgQlNEIEdlbmVyYWwgQ29tbWFuZHMgTWFudWFsICAg
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64d 
> ICAgICAgICAgICBVVUVOQ09ERSgxKQoKTghOQQhBTQhNRQhFCiAgICAgdQh1dQh1ZAhkZQhlYwhj
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64d 
> bwhvZAhkZQhlLCB1CHV1CHVlCGVuCG5jCGNvCG9kCGRlCGUgLS0gZW5jb2RlL2RlY29kZSBhIGJp
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64d 
> bmFyeSBmaWxlCgpTCFNZCFlOCE5PCE9QCFBTCFNJCElTCFMKICAgICB1CHV1CHVlCGVuCG5jCGNv
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64d 
> CG9kCGRlCGUgWy0ILW0IbV0gWy0ILW8IbyBfCG9fCHVfCHRfCHBfCHVfCHRfCF9fCGZfCGlfCGxf
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64d 
> CGVdIFtfCGZfCGlfCGxfCGVdIF8Ibl8IYV8IbV8IZQogICAgIHUIdXUIdWQIZGUIZWMIY28Ib2QI
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64d 
> ZGUIZSBbLQgtYwhjaQhpcAhwcwhzXSBbXwhmXwhpXwhsXwhlIF8ILl8ILl8ILl0KICAgICB1CHV1
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64d 
> CHVkCGRlCGVjCGNvCG9kCGRlCGUgWy0ILWkIaV0gLQgtbwhvIF8Ib18IdV8IdF8IcF8IdV8IdF8I
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64d 
> X18IZl8IaV8IbF8IZSBbXwhmXwhpXwhsXwhlXQo=
> cc97d19b-efa9-4f64-a5e9-a94d9418f271-38 b64!
> {noformat}
> n: starts a nodestate. n+  and n^  both reference a nodestate which has been 
> defined before. The first one says that a node has been created, the second 
> one means an existing node has been changed, indicating the nodestate's uuid 
> and the uuid of the basestate. This is slightly different from the format 
> described in OAK-7849 which doesn't record the uuids. b64+ adds a new base64 
> encoded blob with the given uuid followed by the encoded data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to