Re: [Proposal] Replicated log storage compaction

Benjamin Mahler Fri, 06 Jul 2018 14:35:40 -0700

I was chatting with Ilya on slack and I'll re-post here:

* Like Jie, I was hoping for a toggle (maybe it should start default off
until we have production experience? sounds like Ilya has already
experience with it running in test clusters so far)


* I was asking whether this would be considered a flaw in leveldb's
compaction algorithm. Ilya didn't see any changes in recent leveldb
releases that would affect this. So, we probably should file an issue to
see if they think it's a flaw and whether our workaround makes sense to
them. We can reference this in the code for posterity.

On Fri, Jul 6, 2018 at 2:24 PM, Jie Yu <j...@mesosphere.io> wrote:

> Sounds good to me.
>
> My only ask is to have a way to turn this feature off (flag, env var, etc)
>
> - Jie
>
> On Fri, Jul 6, 2018 at 1:39 PM, Vinod Kone <vinodk...@apache.org> wrote:
>
>> I don't know about the replicated log, but the proposal seems find to me.
>>
>> Jie/BenM, do you guys have an opinion?
>>
>> On Mon, Jul 2, 2018 at 10:57 PM Santhosh Kumar Shanmugham <
>> sshanmug...@twitter.com.invalid> wrote:
>>
>>> +1. Aurora will hugely benefit from this change.
>>>
>>> On Mon, Jul 2, 2018 at 4:49 PM Ilya Pronin <ipro...@twopensource.com>
>>> wrote:
>>>
>>> > Hi everyone,
>>> >
>>> > I'd like to propose adding "manual" LevelDB compaction to the
>>> > replicated log truncation process.
>>> >
>>> > Motivation
>>> >
>>> > Mesos Master and Aurora Scheduler use the replicated log to persist
>>> > information about the cluster. This log is periodically truncated to
>>> > prune outdated log entries. However the replicated log storage is not
>>> > compacted and grows without bounds. This leads to problems like
>>> > synchronous failover of all master/scheduler replicas happening
>>> > because all of them ran out of disk space.
>>> >
>>> > The only time when log storage compaction happens is during recovery.
>>> > Because of that periodic failovers are required to control the
>>> > replicated log storage growth. But this solution is suboptimal.
>>> > Failovers are not instant: e.g. Aurora Scheduler needs to recover the
>>> > storage which depending on the cluster can take several minutes.
>>> > During the downtime tasks cannot be (re-)scheduled and users cannot
>>> > interact with the service.
>>> >
>>> > Proposal
>>> >
>>> > In MESOS-184 John Sirois pointed out that our usage pattern doesn’t
>>> > work well with LevelDB background compaction algorithm. Fortunately,
>>> > LevelDB provides a way to force compaction with DB::CompactRange()
>>> > method. Replicated log storage can trigger it after persisting learned
>>> > TRUNCATE action and deleting truncated log positions. The compacted
>>> > range will be from previous first position of the log to the new first
>>> > position (the one the log was truncated up to).
>>> >
>>> > Performance impact
>>> >
>>> > Mesos Master and Aurora Scheduler have 2 different replicated log
>>> > usage profiles. For Mesos Master every registry update (agent
>>> > (re-)registration/marking, maintenance schedule update, etc.) induces
>>> > writing a complete snapshot which depending on the cluster size can
>>> > get pretty big (in a scale test fake cluster with 55k agents it is
>>> > ~15MB). Every snapshot is followed by a truncation of all previous
>>> > entries, which doesn't block the registrar and happens kind of in the
>>> > background. In the scale test cluster with 55k agents compactions
>>> > after such truncations take ~680ms.
>>> >
>>> > To reduce the performance impact for the Master compaction can be
>>> > triggered only after more than some configurable number of keys were
>>> > deleted.
>>> >
>>> > Aurora Scheduler writes incremental changes of its storage to the
>>> > replicated log. Every hour a storage snapshot is created and persisted
>>> > to the log, followed by a truncation of all entries preceding the
>>> > snapshot. Therefore, storage compactions will be infrequent but will
>>> > deal with potentially large number of keys. In the scale test cluster
>>> > such compactions took ~425ms each.
>>> >
>>> > Please let me know what you think about it.
>>> >
>>> > Thanks!
>>> >
>>> > --
>>> > Ilya Pronin
>>> >
>>>
>>
>

Re: [Proposal] Replicated log storage compaction

Reply via email to