As discussed, I've filed a LevelDB issue: https://github.com/google/leveldb/issues/603. So far it seems that the LevelDB behavior that we see is unexpected.
I'll post a patch with the temporary workaround that I described in the first email in https://issues.apache.org/jira/browse/MESOS-184. It will be disabled by default. On Wed, Jul 11, 2018 at 2:23 PM, Judith Malnick <jmaln...@mesosphere.io> wrote: > Hey Ilya, > If you'd like to generate some real-time conversation about your proposal > this might be a good thing to talk about during tomorrow's developer sync > at 10:00 am Pacific time. If you're interested please feel free to put it > on the agenda > <https://docs.google.com/document/d/153CUCj5LOJCFAVpdDZC7COJDwKh9RDjxaTA0S7lzwDA/edit?usp=sharing> > ! > All the best! > Judith > > On Fri, Jul 6, 2018 at 2:35 PM Benjamin Mahler <bmah...@apache.org> wrote: > >> I was chatting with Ilya on slack and I'll re-post here: >> >> * Like Jie, I was hoping for a toggle (maybe it should start default off >> until we have production experience? sounds like Ilya has already >> experience with it running in test clusters so far) >> >> * I was asking whether this would be considered a flaw in leveldb's >> compaction algorithm. Ilya didn't see any changes in recent leveldb >> releases that would affect this. So, we probably should file an issue to >> see if they think it's a flaw and whether our workaround makes sense to >> them. We can reference this in the code for posterity. >> >> On Fri, Jul 6, 2018 at 2:24 PM, Jie Yu <j...@mesosphere.io> wrote: >> >> > Sounds good to me. >> > >> > My only ask is to have a way to turn this feature off (flag, env var, >> etc) >> > >> > - Jie >> > >> > On Fri, Jul 6, 2018 at 1:39 PM, Vinod Kone <vinodk...@apache.org> wrote: >> > >> >> I don't know about the replicated log, but the proposal seems find to >> me. >> >> >> >> Jie/BenM, do you guys have an opinion? >> >> >> >> On Mon, Jul 2, 2018 at 10:57 PM Santhosh Kumar Shanmugham < >> >> sshanmug...@twitter.com.invalid> wrote: >> >> >> >>> +1. Aurora will hugely benefit from this change. >> >>> >> >>> On Mon, Jul 2, 2018 at 4:49 PM Ilya Pronin <ipro...@twopensource.com> >> >>> wrote: >> >>> >> >>> > Hi everyone, >> >>> > >> >>> > I'd like to propose adding "manual" LevelDB compaction to the >> >>> > replicated log truncation process. >> >>> > >> >>> > Motivation >> >>> > >> >>> > Mesos Master and Aurora Scheduler use the replicated log to persist >> >>> > information about the cluster. This log is periodically truncated to >> >>> > prune outdated log entries. However the replicated log storage is not >> >>> > compacted and grows without bounds. This leads to problems like >> >>> > synchronous failover of all master/scheduler replicas happening >> >>> > because all of them ran out of disk space. >> >>> > >> >>> > The only time when log storage compaction happens is during recovery. >> >>> > Because of that periodic failovers are required to control the >> >>> > replicated log storage growth. But this solution is suboptimal. >> >>> > Failovers are not instant: e.g. Aurora Scheduler needs to recover the >> >>> > storage which depending on the cluster can take several minutes. >> >>> > During the downtime tasks cannot be (re-)scheduled and users cannot >> >>> > interact with the service. >> >>> > >> >>> > Proposal >> >>> > >> >>> > In MESOS-184 John Sirois pointed out that our usage pattern doesn’t >> >>> > work well with LevelDB background compaction algorithm. Fortunately, >> >>> > LevelDB provides a way to force compaction with DB::CompactRange() >> >>> > method. Replicated log storage can trigger it after persisting >> learned >> >>> > TRUNCATE action and deleting truncated log positions. The compacted >> >>> > range will be from previous first position of the log to the new >> first >> >>> > position (the one the log was truncated up to). >> >>> > >> >>> > Performance impact >> >>> > >> >>> > Mesos Master and Aurora Scheduler have 2 different replicated log >> >>> > usage profiles. For Mesos Master every registry update (agent >> >>> > (re-)registration/marking, maintenance schedule update, etc.) induces >> >>> > writing a complete snapshot which depending on the cluster size can >> >>> > get pretty big (in a scale test fake cluster with 55k agents it is >> >>> > ~15MB). Every snapshot is followed by a truncation of all previous >> >>> > entries, which doesn't block the registrar and happens kind of in the >> >>> > background. In the scale test cluster with 55k agents compactions >> >>> > after such truncations take ~680ms. >> >>> > >> >>> > To reduce the performance impact for the Master compaction can be >> >>> > triggered only after more than some configurable number of keys were >> >>> > deleted. >> >>> > >> >>> > Aurora Scheduler writes incremental changes of its storage to the >> >>> > replicated log. Every hour a storage snapshot is created and >> persisted >> >>> > to the log, followed by a truncation of all entries preceding the >> >>> > snapshot. Therefore, storage compactions will be infrequent but will >> >>> > deal with potentially large number of keys. In the scale test cluster >> >>> > such compactions took ~425ms each. >> >>> > >> >>> > Please let me know what you think about it. >> >>> > >> >>> > Thanks! >> >>> > >> >>> > -- >> >>> > Ilya Pronin >> >>> > >> >>> >> >> >> > >> > > > -- > Judith Malnick > Community Manager > 310-709-1517