I'm fairly ignorant to some of the practicalities here - if you don't write
to /mnt/mesos/sandbox, where do files land?  Some other ephemeral directory
that dies with the container?

-=Bill

On Thu, Apr 9, 2015 at 7:11 AM, Hussein Elgridly <[email protected]
> wrote:

> Thanks, that's helpful. I've also just discovered that Thermos only
> monitors disk usage in the sandbox location, so if we launch a Docker job
> and write to anywhere that's not /mnt/mesos/sandbox, we can exceed our disk
> quota. I can work around this by turning our scratch space directories into
> symlinks located under the sandbox, though.
>
> Hussein Elgridly
> Senior Software Engineer, DSDE
> The Broad Institute of MIT and Harvard
>
>
> On 8 April 2015 at 19:43, Zameer Manji <[email protected]> wrote:
>
> > Hey,
> >
> > The deletion of sandbox directories is done by the Mesos slave not the GC
> > executor. You will have to ask Mesos devs on the relationship between low
> > disk and sandbox deletion.
> >
> > The executor enforces disk usage by running `du` in the background
> > periodically. I suspect in your case your process fails before the
> executor
> > notices the disk usage has been exceeded and marks the task as failed.
> This
> > explains why the disk usage message is not there.
> >
> > I'm not sure why the finalizers are not running, but you should note that
> > they are best effort by the executor. The executor won't be able to run
> > them if Mesos tears down the container from underneath it for example.
> >
> > On Mon, Apr 6, 2015 at 10:30 AM, Hussein Elgridly <
> > [email protected]> wrote:
> >
> > > Hi folks,
> > >
> > > I've just had my first task fail due to exceeding disk capacity, and
> I've
> > > run into some strange behaviour.
> > >
> > > It's a Java process that's running inside a Docker container specified
> in
> > > the task config. The Java process is failing with java.io.IOException:
> No
> > > space left on device when attempting to write a file.
> > >
> > > Three things are (or aren't) then happening which I think are just
> plain
> > > wrong:
> > >
> > > 1. The task is being marked as failed (good!) but isn't reporting that
> it
> > > exceeded disk limits (bad). I was expecting to see the "Disk limit
> > > exceeded.  Reserved X bytes vs used Y bytes." message, but neither the
> > > Mesos nor Aurora web interfaces are telling me this.
> > > 2. The task's sandbox directory is being nuked. All of it, immediately.
> > > There while the job is running, vanished as soon as it fails (I
> happened
> > to
> > > be watching it live). This makes debugging difficult, and the
> > > Aurora/Thermos web UI clearly has trouble because it reports the
> resource
> > > requests as all zero when they most definitely weren't.
> > > 3. Finalizers aren't running. No finalizers = no error log = no
> > debugging =
> > > sadface. :(
> > >
> > > I think what's actually happening here is that the process is running
> out
> > > of disk on the machine itself and that IOException is propagating up
> from
> > > the kernel, rather than Mesos killing the process from its disk usage
> > > monitoring.
> > >
> > > As such, we're going to try configuring the Mesos slaves with
> > > --resources='disk:some_smaller_value' to leave a little overhead in the
> > > hope that the Mesos disk monitor catches the overusage before the
> process
> > > attempts to claim the last free block on disk.
> > >
> > > I don't know why it'd be nuking the sandbox, though. And is the GC
> > executor
> > > more aggressive about cleaning out old sandbox directories if the disk
> is
> > > low on free space?
> > >
> > > If it helps, we're on Aurora commit
> > > 2bf03dc5eae89b1e40bfd47683c54c185c78a9d3.
> > >
> > > Thanks,
> > >
> > > Hussein Elgridly
> > > Senior Software Engineer, DSDE
> > > The Broad Institute of MIT and Harvard
> > >
> > > --
> > > Zameer Manji
> > >
> > >
> >
>

Reply via email to