[ 
https://issues.apache.org/jira/browse/OAK-6921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomek Rękawek updated OAK-6921:
-------------------------------
    Description: 
h3. Rationale

segment-tar, as names suggest, stores the segments in a bunch of tar archives, 
inside the {{segmentstore}} directory on the local file system. For some cases, 
especially in the cloud deployments, it may be interesting to store the 
segments outside the local FS - the remote storage such as Amazon S3, Azure 
Blob Storage or HDFS may be cheaper than a mounted disk, more scalable, easier 
for the provisioning, etc.

h3. Current state

There are 3 classes responsible for handling tar files in the segment-tar: 
TarFiles, TarWriter and TarReader. The TarFiles manages the {{segmentstore}} 
directory, scans it for the .tars and for each one creates a TarReader. It also 
creates a single TarWriter object, used to write (and also read) the most 
recent tar file.

The TarWriter appends segments to the latest tar file and also serializes the 
auxiliary indexes: segment index, binary references index and the segment 
graph. It also takes of synchronization, as we're dealing with a mutable data 
structure - tar file opened in the append mode.

The TarReader not only reads the segments from the tar file, but is also 
responsible for the revision GC (mark & sweep methods) and recovering data from 
files which hasn't been closed cleanly (eg. have no index).

h3. New abstraction layer

In order to store segments not in the tar files, but somewhere else, it'd be 
possible to create own implementation of the TarFiles, TarWriter and TarReader. 
However, such implementation would duplicate a lot of code, not strictly 
related to the persistence - mark(), sweep(), synchronization, etc. Rather than 
that, the attached patch presents a different approach: a new layer of 
abstraction is injected into TarFiles, TarWriter and TarReader - it only takes 
care of the segments persistence and knows nothing about the synchronization, 
GC, etc. - leaving it to the upper layer.

The new abstraction layer is modelled using 3 new classes: 
SegmentArchiveManager, SegmentArchiveReader and SegmentArchiveWriter. They are 
strictly related to the existing Tar* classes and used by them.

SegmentArchiveManager provides a bunch of file system-style methods, like 
open(), create(), delete(), exists(), etc. The open() and create() returns 
instances of the SAReader and SAWriter.

SegmentArchiveReader, despite from reading segments, can also load and parse 
the index, graph and binary references. The logic responsible for parsing these 
structures has been already extracted, so it doesn't need to be duplicated in 
the SAReader implementations. Also, SAReader needs to be aware about the index, 
since it contains the segment offsets.

The SAWriter class allows to write and read the segments and also store the 
indexes. It isn't thread safe - it assumes that the synchronization is already 
done on the higher layers.

In the patch, I've moved the tar implementation to the new classes: 
SegmentTarManager, SegmentTarReader and SegmentTarWriter.

h3. TODO

* The names and package locations for all the affected classes are subjects to 
change - after applying the patch the TarFiles doesn't deal with the .tar files 
anymore, similarly the TarReader and TarWriter delegates the low-level file 
access duties to the SegmentArchiveReader and Writer. I didn't want to change 
the names yet, to make it easier to understand and rebase the patch with the 
trunk changes.
* SegmentNodeStoreService should allow to get the SegmentArchiveManager service 
from the OSGi (so the implementations can be added in other bundles).

  was:
h3. Rationale

segment-tar, as names suggest, stores the segments in a bunch of tar archives, 
inside the {{segmentstore}} directory on the local file system. For some cases, 
especially in the cloud deployments, it may be interesting to store the 
segments outside the local FS - the remote storage such as Amazon S3, Azure 
Blob Storage or HDFS may be cheaper than a mounted disk, more scalable, easier 
for the provisioning, etc.

h3. Current state

There are 3 classes responsible for handling tar files in the segment-tar: 
TarFiles, TarWriter and TarReader. The TarFiles manages the {{segmentstore}} 
directory, scans it for the .tars and for each one creates a TarReader. It also 
creates a single TarWriter object, used to write (and also read) the most 
recent tar file.

The TarWriter appends segments to the latest tar file and also serializes the 
auxiliary indexes: segment index, binary references index and the segment 
graph. It also takes of synchronization, as we're dealing with a mutable data 
structure - tar file opened in the append mode.

The TarReader not only reads the segments from the tar file, but is also 
responsible for the revision GC (mark & sweep methods) and recovering data from 
files which hasn't been closed cleanly (eg. have no index).

h3. New abstraction layer

In order to store segments not in the tar files, but somewhere else, it'd be 
possible to create own implementation of the TarFiles, TarWriter and TarReader. 
However, such implementation would duplicate a lot of code, not strictly 
related to the persistence - mark(), sweep(), synchronization, etc. Rather than 
that, the attached patch presents a different approach: a new layer of 
abstraction is injected into TarFiles, TarWriter and TarReader - it only takes 
care of the segments persistence and knows nothing about the synchronization, 
GC, etc. - leaving it to the upper layer.

The new abstraction layer is modelled using 3 new classes: 
SegmentArchiveManager, SegmentArchiveReader and SegmentArchiveWriter. They are 
strictly related to the existing Tar* classes and used by them.

SegmentArchiveManager provides a bunch of file system-style methods, like 
open(), create(), delete(), exists(), etc. The open() and create() returns 
instances of the SAReader and SAWriter.

SegmentArchiveReader, despite from reading segments, can also load and parse 
the index, graph and binary references. The logic responsible for parsing these 
structures has been already extracted, so it doesn't need to be duplicated in 
the SAReader implementations. Also, SAReader needs to be aware about the index, 
since it contains the segment offsets.

The SAWriter class allows to write and read the segments and also store the 
indexes. It isn't thread safe - it assumes that the synchronization is already 
done on the higher layers.

In the patch, I've moved the tar implementation to the new classes: 
SegmentTarManager, SegmentTarReader and SegmentTarWriter.

h3. TODO

* The names and package locations for all the affected classes are subjects to 
change - after applying the patch the TarFiles doesn't deal with the .tar files 
anymore, similarly the TarReader and TarWriter delegates the low-level file 
access duties to the SegmentArchiveReader and Writer. I didn't want to change 
the names yet, to make it easier to understand and rebase the patch with the 
trunk changes.
* Add JUnit documentation to the new interfaces.
* SegmentNodeStoreService should allow to get the SegmentArchiveManager service 
from the OSGi (so the implementations can be added in other bundles).


> Support pluggable segment storage
> ---------------------------------
>
>                 Key: OAK-6921
>                 URL: https://issues.apache.org/jira/browse/OAK-6921
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: segment-tar
>            Reporter: Tomek Rękawek
>             Fix For: 1.9.0
>
>         Attachments: OAK-6921.patch
>
>
> h3. Rationale
> segment-tar, as names suggest, stores the segments in a bunch of tar 
> archives, inside the {{segmentstore}} directory on the local file system. For 
> some cases, especially in the cloud deployments, it may be interesting to 
> store the segments outside the local FS - the remote storage such as Amazon 
> S3, Azure Blob Storage or HDFS may be cheaper than a mounted disk, more 
> scalable, easier for the provisioning, etc.
> h3. Current state
> There are 3 classes responsible for handling tar files in the segment-tar: 
> TarFiles, TarWriter and TarReader. The TarFiles manages the {{segmentstore}} 
> directory, scans it for the .tars and for each one creates a TarReader. It 
> also creates a single TarWriter object, used to write (and also read) the 
> most recent tar file.
> The TarWriter appends segments to the latest tar file and also serializes the 
> auxiliary indexes: segment index, binary references index and the segment 
> graph. It also takes of synchronization, as we're dealing with a mutable data 
> structure - tar file opened in the append mode.
> The TarReader not only reads the segments from the tar file, but is also 
> responsible for the revision GC (mark & sweep methods) and recovering data 
> from files which hasn't been closed cleanly (eg. have no index).
> h3. New abstraction layer
> In order to store segments not in the tar files, but somewhere else, it'd be 
> possible to create own implementation of the TarFiles, TarWriter and 
> TarReader. However, such implementation would duplicate a lot of code, not 
> strictly related to the persistence - mark(), sweep(), synchronization, etc. 
> Rather than that, the attached patch presents a different approach: a new 
> layer of abstraction is injected into TarFiles, TarWriter and TarReader - it 
> only takes care of the segments persistence and knows nothing about the 
> synchronization, GC, etc. - leaving it to the upper layer.
> The new abstraction layer is modelled using 3 new classes: 
> SegmentArchiveManager, SegmentArchiveReader and SegmentArchiveWriter. They 
> are strictly related to the existing Tar* classes and used by them.
> SegmentArchiveManager provides a bunch of file system-style methods, like 
> open(), create(), delete(), exists(), etc. The open() and create() returns 
> instances of the SAReader and SAWriter.
> SegmentArchiveReader, despite from reading segments, can also load and parse 
> the index, graph and binary references. The logic responsible for parsing 
> these structures has been already extracted, so it doesn't need to be 
> duplicated in the SAReader implementations. Also, SAReader needs to be aware 
> about the index, since it contains the segment offsets.
> The SAWriter class allows to write and read the segments and also store the 
> indexes. It isn't thread safe - it assumes that the synchronization is 
> already done on the higher layers.
> In the patch, I've moved the tar implementation to the new classes: 
> SegmentTarManager, SegmentTarReader and SegmentTarWriter.
> h3. TODO
> * The names and package locations for all the affected classes are subjects 
> to change - after applying the patch the TarFiles doesn't deal with the .tar 
> files anymore, similarly the TarReader and TarWriter delegates the low-level 
> file access duties to the SegmentArchiveReader and Writer. I didn't want to 
> change the names yet, to make it easier to understand and rebase the patch 
> with the trunk changes.
> * SegmentNodeStoreService should allow to get the SegmentArchiveManager 
> service from the OSGi (so the implementations can be added in other bundles).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to