Hi,

Playing with the TarMK compaction lately, I realized that the process may
create additional files even if globally there is no real need to do so
(not enough garbage to justify running compaction).

The way it works now is: you manually trigger the compaction process, this
will start copying content (via a diff) to new files to allow the old tar
files to be GC'ed. Once done, the cleanup process starts. The cleanup
process will look at each tar file and if it has > 25% garbage it will be
cleaned up (a new generation is created containing only the relevant
content, no garbage).

The disconnect between the compaction and the cleanup can cause even a
clean repo to grow (each new file has a fixed size of 256mb), so if
compaction adds 256mb but the cleanup doesn't find anything useful, your
repo will go up 256mb for no real reason. Over time this will stabilize,
but the first time increase can be a bit unexpected. And the bigger the
repository the bigger the increase.

I'm proposing a solution to alleviate this problem. I'd like to first check
if there is enough garbage in the repo to justify running compaction: check
each tar file and if there's at least one that needs cleanup (>25% garbage)
only then allow the compaction & cleanup to go through. This should
stabilize the size of a repo that didn't change much since the last
compaction run.

I've created OAK-2019 to track this.

Opinions are highly welcome!

alex

Reply via email to