On 7.11.12 9:48, Thomas Mueller wrote:
Hi,

Didn't we talk once about defining a format for blob id references, so
that a value of the format "bin:{blobId}" (or similar) is reference?

This is exactly the problem I wanted to pinpoint. There is a conceptual leak here: in order for the Microkernel implementation to know that something is a reference to a binary, it has to know about the interpretation of the items in the repository by the upper layers.

Michael


Regards,
Thomas



On 11/7/12 10:17 AM, "Michael Dürig" <mdue...@apache.org> wrote:


On a related note: how does the garbage collector even find out whether
a binary is "referenced"? That is, on the Microkernel level, what does
it actually mean for a binary to be referenced?

Michael

On 6.11.12 18:45, Michael Marth wrote:
this might be a weird question from the leftfield, but are we actually
sure that the existing data store concept is worth the trouble? afaiu it
saves us from storing the same binary twice, but leads into the DSGC
topic. would it be possible to make it optional to store/address
binaries by hash (and thus not need DSGC for these configurations)?

In any case we should definitely avoid to require repo traversal for
DSGC. This would operationally limit the repo sizes Oak can support.


--
Michael Marth | Engineering Manager
+41 61 226 55 22 | mma...@adobe.com<mailto:mma...@adobe.com>
Barfüsserplatz 6, CH-4001 Basel, Switzerland

On Nov 6, 2012, at 9:24 AM, Thomas Mueller wrote:

Hi,

1- What's considered an "old" node or commit? Technically, anything
other
than the head revision is old but can we remove them right away or do we
need to retain a number of revisions? If the latter, then how far back
do
we need to retain?

we discussed this a while back, no good solution back then[1]

Yes. Somebody has to decide which revisions are no longer needed.
Luckily
it doesn't need to be us :-) We might set a default value (10 minutes or
so), and then give the user the ability to change that, depending on
whether he cares more about disk space or the ability to read old data /
roll back to an old state.

To free up disk space, BlobStore garbage collection is actually more
important, because usually 90% of the disk space is used by the
BlobStore.
So it would be nice if items (files) in the BlobStore are deleted as
soon
as possible after deleting old revisions. In Jackrabbit 2.x we have seen
that node and data store garbage collection that has to traverse the
whole
repository is problematic if the repository is large. So garbage
collection can be a scalability issue: if we have to traverse all
revisions of all nodes in order to delete unused data, we basically tie
garbage collection speed with repository size, unless if we find a way
to
run it in parallel. But running mark & sweep garbage collection
completely
in parallel is not easy (is it even possible? if yes I would have
guessed
modern JVMs should have it since a long time). So I think if we don't
need
to traverse the repository to delete old nodes, but just traverse the
journal, this would be much less of a problem.

Regards,
Thomas




Reply via email to