I'm not sure if this at all related to the issue you're seeing, but we ran
into this fun issue (or at least this seems to be the cause) helpfully
documented on this blog article:
http://blog.nitrous.io/2014/03/10/stability-and-a-linux-oom-killer-bug.html.

TLDR: OOM killer getting into an infinite loop, causing the CPU to spin out
of control on our VMs.

More details in this commit message to the OOM killer earlier this year;
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0c740d0afc3bff0a097ad03a1c8df92757516f5c

Hope this helps somewhat...

On 26 September 2014 14:15, Tomas Barton <barton.to...@gmail.com> wrote:

> Just to make sure, all slaves are running with:
>
> --isolation='cgroups/cpu,cgroups/mem'
>
> Is there something suspicious in mesos slave logs?
>
> On 26 September 2014 13:20, Stephan Erb <stephan....@blue-yonder.com>
> wrote:
>
>>  Hi everyone,
>>
>> I am having issues with the cgroups isolation of Mesos. It seems like
>> tasks are prevented from allocating more memory than their limit. However,
>> they are never killed.
>>
>>    - My scheduled task allocates memory in a tight loop. According to
>>    'ps', once its memory requirements are exceeded it is not killed, but ends
>>    up in the state D ("uninterruptible sleep (usually IO)").
>>    - The task is still considered running by Mesos.
>>    - There is no indication of an OOM in dmesg.
>>    - There is neither an OOM notice nor any other output related to the
>>    task in the slave log.
>>    - According to htop, the system load is increased with a significant
>>    portion of CPU time spend within the kernel. Commonly the load is so high
>>    that all zookeeper connections time out.
>>
>> I am running Aurora and Mesos 0.20.1 using the cgroups isolation on
>> Debian 7 (kernel 3.2.60-1+deb7u3). .
>>
>> Sorry for the somewhat unspecific error description. Still, anyone an
>> idea what might be wrong here?
>>
>> Thanks and Best Regards,
>> Stephan
>>
>
>

Reply via email to