Re: [Proposal] Replicated log storage compaction

Ilya Pronin Fri, 27 Jul 2018 13:12:02 -0700

As discussed, I've filed a LevelDB issue:
https://github.com/google/leveldb/issues/603. So far it seems that the
LevelDB behavior that we see is unexpected.


I'll post a patch with the temporary workaround that I described in
the first email in https://issues.apache.org/jira/browse/MESOS-184. It
will be disabled by default.

On Wed, Jul 11, 2018 at 2:23 PM, Judith Malnick <[email protected]> wrote:
> Hey Ilya,
> If you'd like to generate some real-time conversation about your proposal
> this might be a good thing to talk about during tomorrow's developer sync
> at 10:00 am Pacific time. If you're interested please feel free to put it
> on the agenda
> <https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit?usp=sharing>
> !
> All the best!
> Judith
>
> On Fri, Jul 6, 2018 at 2:35 PM Benjamin Mahler <[email protected]> wrote:
>
>> I was chatting with Ilya on slack and I'll re-post here:
>>
>> * Like Jie, I was hoping for a toggle (maybe it should start default off
>> until we have production experience? sounds like Ilya has already
>> experience with it running in test clusters so far)
>>
>> * I was asking whether this would be considered a flaw in leveldb's
>> compaction algorithm. Ilya didn't see any changes in recent leveldb
>> releases that would affect this. So, we probably should file an issue to
>> see if they think it's a flaw and whether our workaround makes sense to
>> them. We can reference this in the code for posterity.
>>
>> On Fri, Jul 6, 2018 at 2:24 PM, Jie Yu <[email protected]> wrote:
>>
>> > Sounds good to me.
>> >
>> > My only ask is to have a way to turn this feature off (flag, env var,
>> etc)
>> >
>> > - Jie
>> >
>> > On Fri, Jul 6, 2018 at 1:39 PM, Vinod Kone <[email protected]> wrote:
>> >
>> >> I don't know about the replicated log, but the proposal seems find to
>> me.
>> >>
>> >> Jie/BenM, do you guys have an opinion?
>> >>
>> >> On Mon, Jul 2, 2018 at 10:57 PM Santhosh Kumar Shanmugham <
>> >> [email protected]> wrote:
>> >>
>> >>> +1. Aurora will hugely benefit from this change.
>> >>>
>> >>> On Mon, Jul 2, 2018 at 4:49 PM Ilya Pronin <[email protected]>
>> >>> wrote:
>> >>>
>> >>> > Hi everyone,
>> >>> >
>> >>> > I'd like to propose adding "manual" LevelDB compaction to the
>> >>> > replicated log truncation process.
>> >>> >
>> >>> > Motivation
>> >>> >
>> >>> > Mesos Master and Aurora Scheduler use the replicated log to persist
>> >>> > information about the cluster. This log is periodically truncated to
>> >>> > prune outdated log entries. However the replicated log storage is not
>> >>> > compacted and grows without bounds. This leads to problems like
>> >>> > synchronous failover of all master/scheduler replicas happening
>> >>> > because all of them ran out of disk space.
>> >>> >
>> >>> > The only time when log storage compaction happens is during recovery.
>> >>> > Because of that periodic failovers are required to control the
>> >>> > replicated log storage growth. But this solution is suboptimal.
>> >>> > Failovers are not instant: e.g. Aurora Scheduler needs to recover the
>> >>> > storage which depending on the cluster can take several minutes.
>> >>> > During the downtime tasks cannot be (re-)scheduled and users cannot
>> >>> > interact with the service.
>> >>> >
>> >>> > Proposal
>> >>> >
>> >>> > In MESOS-184 John Sirois pointed out that our usage pattern doesn’t
>> >>> > work well with LevelDB background compaction algorithm. Fortunately,
>> >>> > LevelDB provides a way to force compaction with DB::CompactRange()
>> >>> > method. Replicated log storage can trigger it after persisting
>> learned
>> >>> > TRUNCATE action and deleting truncated log positions. The compacted
>> >>> > range will be from previous first position of the log to the new
>> first
>> >>> > position (the one the log was truncated up to).
>> >>> >
>> >>> > Performance impact
>> >>> >
>> >>> > Mesos Master and Aurora Scheduler have 2 different replicated log
>> >>> > usage profiles. For Mesos Master every registry update (agent
>> >>> > (re-)registration/marking, maintenance schedule update, etc.) induces
>> >>> > writing a complete snapshot which depending on the cluster size can
>> >>> > get pretty big (in a scale test fake cluster with 55k agents it is
>> >>> > ~15MB). Every snapshot is followed by a truncation of all previous
>> >>> > entries, which doesn't block the registrar and happens kind of in the
>> >>> > background. In the scale test cluster with 55k agents compactions
>> >>> > after such truncations take ~680ms.
>> >>> >
>> >>> > To reduce the performance impact for the Master compaction can be
>> >>> > triggered only after more than some configurable number of keys were
>> >>> > deleted.
>> >>> >
>> >>> > Aurora Scheduler writes incremental changes of its storage to the
>> >>> > replicated log. Every hour a storage snapshot is created and
>> persisted
>> >>> > to the log, followed by a truncation of all entries preceding the
>> >>> > snapshot. Therefore, storage compactions will be infrequent but will
>> >>> > deal with potentially large number of keys. In the scale test cluster
>> >>> > such compactions took ~425ms each.
>> >>> >
>> >>> > Please let me know what you think about it.
>> >>> >
>> >>> > Thanks!
>> >>> >
>> >>> > --
>> >>> > Ilya Pronin
>> >>> >
>> >>>
>> >>
>> >
>>
>
>
> --
> Judith Malnick
> Community Manager
> 310-709-1517

Re: [Proposal] Replicated log storage compaction

Reply via email to