Hey Ilya, If you'd like to generate some real-time conversation about your proposal this might be a good thing to talk about during tomorrow's developer sync at 10:00 am Pacific time. If you're interested please feel free to put it on the agenda <https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit?usp=sharing> ! All the best! Judith
On Fri, Jul 6, 2018 at 2:35 PM Benjamin Mahler <[email protected]> wrote: > I was chatting with Ilya on slack and I'll re-post here: > > * Like Jie, I was hoping for a toggle (maybe it should start default off > until we have production experience? sounds like Ilya has already > experience with it running in test clusters so far) > > * I was asking whether this would be considered a flaw in leveldb's > compaction algorithm. Ilya didn't see any changes in recent leveldb > releases that would affect this. So, we probably should file an issue to > see if they think it's a flaw and whether our workaround makes sense to > them. We can reference this in the code for posterity. > > On Fri, Jul 6, 2018 at 2:24 PM, Jie Yu <[email protected]> wrote: > > > Sounds good to me. > > > > My only ask is to have a way to turn this feature off (flag, env var, > etc) > > > > - Jie > > > > On Fri, Jul 6, 2018 at 1:39 PM, Vinod Kone <[email protected]> wrote: > > > >> I don't know about the replicated log, but the proposal seems find to > me. > >> > >> Jie/BenM, do you guys have an opinion? > >> > >> On Mon, Jul 2, 2018 at 10:57 PM Santhosh Kumar Shanmugham < > >> [email protected]> wrote: > >> > >>> +1. Aurora will hugely benefit from this change. > >>> > >>> On Mon, Jul 2, 2018 at 4:49 PM Ilya Pronin <[email protected]> > >>> wrote: > >>> > >>> > Hi everyone, > >>> > > >>> > I'd like to propose adding "manual" LevelDB compaction to the > >>> > replicated log truncation process. > >>> > > >>> > Motivation > >>> > > >>> > Mesos Master and Aurora Scheduler use the replicated log to persist > >>> > information about the cluster. This log is periodically truncated to > >>> > prune outdated log entries. However the replicated log storage is not > >>> > compacted and grows without bounds. This leads to problems like > >>> > synchronous failover of all master/scheduler replicas happening > >>> > because all of them ran out of disk space. > >>> > > >>> > The only time when log storage compaction happens is during recovery. > >>> > Because of that periodic failovers are required to control the > >>> > replicated log storage growth. But this solution is suboptimal. > >>> > Failovers are not instant: e.g. Aurora Scheduler needs to recover the > >>> > storage which depending on the cluster can take several minutes. > >>> > During the downtime tasks cannot be (re-)scheduled and users cannot > >>> > interact with the service. > >>> > > >>> > Proposal > >>> > > >>> > In MESOS-184 John Sirois pointed out that our usage pattern doesn’t > >>> > work well with LevelDB background compaction algorithm. Fortunately, > >>> > LevelDB provides a way to force compaction with DB::CompactRange() > >>> > method. Replicated log storage can trigger it after persisting > learned > >>> > TRUNCATE action and deleting truncated log positions. The compacted > >>> > range will be from previous first position of the log to the new > first > >>> > position (the one the log was truncated up to). > >>> > > >>> > Performance impact > >>> > > >>> > Mesos Master and Aurora Scheduler have 2 different replicated log > >>> > usage profiles. For Mesos Master every registry update (agent > >>> > (re-)registration/marking, maintenance schedule update, etc.) induces > >>> > writing a complete snapshot which depending on the cluster size can > >>> > get pretty big (in a scale test fake cluster with 55k agents it is > >>> > ~15MB). Every snapshot is followed by a truncation of all previous > >>> > entries, which doesn't block the registrar and happens kind of in the > >>> > background. In the scale test cluster with 55k agents compactions > >>> > after such truncations take ~680ms. > >>> > > >>> > To reduce the performance impact for the Master compaction can be > >>> > triggered only after more than some configurable number of keys were > >>> > deleted. > >>> > > >>> > Aurora Scheduler writes incremental changes of its storage to the > >>> > replicated log. Every hour a storage snapshot is created and > persisted > >>> > to the log, followed by a truncation of all entries preceding the > >>> > snapshot. Therefore, storage compactions will be infrequent but will > >>> > deal with potentially large number of keys. In the scale test cluster > >>> > such compactions took ~425ms each. > >>> > > >>> > Please let me know what you think about it. > >>> > > >>> > Thanks! > >>> > > >>> > -- > >>> > Ilya Pronin > >>> > > >>> > >> > > > -- Judith Malnick Community Manager 310-709-1517
