I'm fairly ignorant to some of the practicalities here - if you don't write to /mnt/mesos/sandbox, where do files land? Some other ephemeral directory that dies with the container?
-=Bill On Thu, Apr 9, 2015 at 7:11 AM, Hussein Elgridly <[email protected] > wrote: > Thanks, that's helpful. I've also just discovered that Thermos only > monitors disk usage in the sandbox location, so if we launch a Docker job > and write to anywhere that's not /mnt/mesos/sandbox, we can exceed our disk > quota. I can work around this by turning our scratch space directories into > symlinks located under the sandbox, though. > > Hussein Elgridly > Senior Software Engineer, DSDE > The Broad Institute of MIT and Harvard > > > On 8 April 2015 at 19:43, Zameer Manji <[email protected]> wrote: > > > Hey, > > > > The deletion of sandbox directories is done by the Mesos slave not the GC > > executor. You will have to ask Mesos devs on the relationship between low > > disk and sandbox deletion. > > > > The executor enforces disk usage by running `du` in the background > > periodically. I suspect in your case your process fails before the > executor > > notices the disk usage has been exceeded and marks the task as failed. > This > > explains why the disk usage message is not there. > > > > I'm not sure why the finalizers are not running, but you should note that > > they are best effort by the executor. The executor won't be able to run > > them if Mesos tears down the container from underneath it for example. > > > > On Mon, Apr 6, 2015 at 10:30 AM, Hussein Elgridly < > > [email protected]> wrote: > > > > > Hi folks, > > > > > > I've just had my first task fail due to exceeding disk capacity, and > I've > > > run into some strange behaviour. > > > > > > It's a Java process that's running inside a Docker container specified > in > > > the task config. The Java process is failing with java.io.IOException: > No > > > space left on device when attempting to write a file. > > > > > > Three things are (or aren't) then happening which I think are just > plain > > > wrong: > > > > > > 1. The task is being marked as failed (good!) but isn't reporting that > it > > > exceeded disk limits (bad). I was expecting to see the "Disk limit > > > exceeded. Reserved X bytes vs used Y bytes." message, but neither the > > > Mesos nor Aurora web interfaces are telling me this. > > > 2. The task's sandbox directory is being nuked. All of it, immediately. > > > There while the job is running, vanished as soon as it fails (I > happened > > to > > > be watching it live). This makes debugging difficult, and the > > > Aurora/Thermos web UI clearly has trouble because it reports the > resource > > > requests as all zero when they most definitely weren't. > > > 3. Finalizers aren't running. No finalizers = no error log = no > > debugging = > > > sadface. :( > > > > > > I think what's actually happening here is that the process is running > out > > > of disk on the machine itself and that IOException is propagating up > from > > > the kernel, rather than Mesos killing the process from its disk usage > > > monitoring. > > > > > > As such, we're going to try configuring the Mesos slaves with > > > --resources='disk:some_smaller_value' to leave a little overhead in the > > > hope that the Mesos disk monitor catches the overusage before the > process > > > attempts to claim the last free block on disk. > > > > > > I don't know why it'd be nuking the sandbox, though. And is the GC > > executor > > > more aggressive about cleaning out old sandbox directories if the disk > is > > > low on free space? > > > > > > If it helps, we're on Aurora commit > > > 2bf03dc5eae89b1e40bfd47683c54c185c78a9d3. > > > > > > Thanks, > > > > > > Hussein Elgridly > > > Senior Software Engineer, DSDE > > > The Broad Institute of MIT and Harvard > > > > > > -- > > > Zameer Manji > > > > > > > > >
