Hi everyone, I'd like to propose adding "manual" LevelDB compaction to the replicated log truncation process.
Motivation Mesos Master and Aurora Scheduler use the replicated log to persist information about the cluster. This log is periodically truncated to prune outdated log entries. However the replicated log storage is not compacted and grows without bounds. This leads to problems like synchronous failover of all master/scheduler replicas happening because all of them ran out of disk space. The only time when log storage compaction happens is during recovery. Because of that periodic failovers are required to control the replicated log storage growth. But this solution is suboptimal. Failovers are not instant: e.g. Aurora Scheduler needs to recover the storage which depending on the cluster can take several minutes. During the downtime tasks cannot be (re-)scheduled and users cannot interact with the service. Proposal In MESOS-184 John Sirois pointed out that our usage pattern doesn’t work well with LevelDB background compaction algorithm. Fortunately, LevelDB provides a way to force compaction with DB::CompactRange() method. Replicated log storage can trigger it after persisting learned TRUNCATE action and deleting truncated log positions. The compacted range will be from previous first position of the log to the new first position (the one the log was truncated up to). Performance impact Mesos Master and Aurora Scheduler have 2 different replicated log usage profiles. For Mesos Master every registry update (agent (re-)registration/marking, maintenance schedule update, etc.) induces writing a complete snapshot which depending on the cluster size can get pretty big (in a scale test fake cluster with 55k agents it is ~15MB). Every snapshot is followed by a truncation of all previous entries, which doesn't block the registrar and happens kind of in the background. In the scale test cluster with 55k agents compactions after such truncations take ~680ms. To reduce the performance impact for the Master compaction can be triggered only after more than some configurable number of keys were deleted. Aurora Scheduler writes incremental changes of its storage to the replicated log. Every hour a storage snapshot is created and persisted to the log, followed by a truncation of all entries preceding the snapshot. Therefore, storage compactions will be infrequent but will deal with potentially large number of keys. In the scale test cluster such compactions took ~425ms each. Please let me know what you think about it. Thanks! -- Ilya Pronin