(Ian, CC-ing you because you seemed curious about what was up last week) Hey Vinod, thanks for the quick response! I had been mistakenly thinking that ---disk_watch_interval was related to isolation instead of GC.
We have been running a bunch of long-running services and a few cron jobs on our 6 node (c1.xlarge) cluster. We added a large number of additional cron jobs on the 20th, which maxed out our available resources. I added 3 more slaves and things seemed to be happy. Since the original 6 slaves were mostly allocated to the long-running services, the 3 new slaves ended up handling most of the cron tasks. Not sure what your definition of a very high rate is, but the new jobs were starting 20 new tasks/min max. The disk space errors happened to the three new slaves all within a few hours of each other early the next morning. Grepping the slave logs as you suggested showed that during the last 24 hours, each slave's disk usage steadily increased 1% every ~10 minutes until it hit ~76% disk usage. The disk space error occurs when starting a new task right after (10-20 sec) the last Current usage report. The growing disk usage makes sense because most of the cron tasks had a large slug, and we were being lazy about cleaning up (under the assumption that the GC would do it for us), but all three slaves erroring out at 76% disk usage (on a 1.7TB mount) seems a little suspect. Have you (or anyone else on the list) seen anything like this before? Any advice on what to do to diagnose this further? Thanks! Tom On Thu, Dec 26, 2013 at 2:26 PM, Vinod Kone <vinodk...@gmail.com> wrote: > Hi Thomas, > > The GC in mesos slave works as follows: > > --> Whenever an executor terminates, its sandbox directory is scheduled > for gc for "--gc_delay" seconds into the future by the slave. > > --> However the slave also periodically ("--disk_watch_interval") monitors > the disk utilization and expedites the gc based on the usage. > > For example if gc_delay is 1 week and the current disk utilization is 80% > then instead of waiting for a week to gc a terminated executor's sandbox > the slave gc'es it after 16.8 hours (= (1- GC_DISK_HEADROOM - 0.8) * > 7days). GC_DISK_HEADROOM is currently set to 0.1. > > However it might happen that executors are getting launched (and sandboxes > created) at a very high rate. In this case the slave might not able to > react quickly enough to gc sandboxes. > > You could grep for "Current usage" in the slave log to see how the disk > utilization varies over time. > > HTH, > > > On Thu, Dec 26, 2013 at 10:56 AM, Thomas Petr <tp...@hubspot.com> wrote: > >> Hi, >> >> We're running Mesos 0.14.0-rc4 on CentOS from the mesosphere repository. >> Last week we had an issue where the mesos-slave process died due running >> out of disk space. [1] >> >> The mesos-slave usage docs mention the "[GC] delay may be shorter >> depending on the available disk usage." Does anyone have any insight into >> how the GC logic works? Is there a configurable threshold percentage or >> amount that will force it to clean up more often? >> >> If the mesos-slave process is going to die due to lack of disk space, >> would it make sense for it to attempt one last GC run before giving up? >> >> Thanks, >> Tom >> >> >> [1] >> Could not create logging file: No space left on device >> COULD NOT CREATE A LOGGINGFILE 20131221-120618.20562!F1221 >> 12:06:18.978813 20567 paths.hpp:333] CHECK_SOME(mkdir): Failed to create >> executor directory >> '/usr/share/hubspot/mesos/slaves/201311111611-3792629514-5050-11268-18/frameworks/Singularity11/executors/singularity-ContactsHadoopDynamicListSegJobs-contacts-wal-dynamic-list-seg-refresher-1387627577839-1-littleslash-us_east_1e/runs/457a8df0-baa7-4d22-a5ac-ba5935ea6032'No >> space left on device >> *** Check failure stack trace: *** >> I1221 12:06:19.008946 20564 cgroups_isolator.cpp:1275] Successfully >> destroyed cgroup >> mesos/framework_Singularity11_executor_singularity-ContactsTasks-parallel-machines:6988:list-intersection-count:1387565552709-1387627447707-1-littleslash-us_east_1e_tag_fc028903-d303-468d-902a-dade8c22e206 >> @ 0x7f2c806bcb5d google::LogMessage::Fail() >> @ 0x7f2c806c0b77 google::LogMessage::SendToLog() >> @ 0x7f2c806be9f9 google::LogMessage::Flush() >> @ 0x7f2c806becfd google::LogMessageFatal::~LogMessageFatal() >> @ 0x40f6cf _CheckSome::~_CheckSome() >> @ 0x7f2c804492e3 >> mesos::internal::slave::paths::createExecutorDirectory() >> @ 0x7f2c80418a6d >> mesos::internal::slave::Framework::launchExecutor() >> @ 0x7f2c80419dd3 mesos::internal::slave::Slave::_runTask() >> @ 0x7f2c8042d5d1 std::tr1::_Function_handler<>::_M_invoke() >> @ 0x7f2c805d3ae8 process::ProcessManager::resume() >> @ 0x7f2c805d3e8c process::schedule() >> @ 0x7f2c7fe41851 start_thread >> @ 0x7f2c7e78794d clone >> > >