Re: cgroups memory isolation
Eric pointed out that I had a typo in the instance type -- it's a c3.8xlarge (containing SSDs, which could make a difference here). On Wed, Jun 18, 2014 at 10:36 AM, Thomas Petr tp...@hubspot.com wrote: Thanks for all the info, Ian. We're running CentOS 6 with the 2.6.32 kernel. I ran `dd if=/dev/zero of=lotsazeros bs=1M` as a task in Mesos and got some weird results. I initially gave the task 256 MB, and it never exceeded the memory allocation (I killed the task manually after 5 minutes when the file hit 50 GB). Then I noticed your example was 128 MB, so I resized and tried again. It exceeded memory https://gist.github.com/tpetr/d4ff2adda1b5b0a21f82 almost immediately. The next (replacement) task our framework started ran successfully and never exceeded memory. I watched nr_dirty and it fluctuated between 1 to 14000 when the task is running. The slave host is a c3.xlarge in EC2, if it makes a difference. As Mesos users, we'd like an isolation strategy that isn't affected by cache this much -- it makes it harder for us to appropriately size things. Is it possible through Mesos or cgroups itself to make the page cache not count towards the total memory consumption? If the answer is no, do you think it'd be worth looking at using Docker for isolation instead? -Tom On Tue, Jun 17, 2014 at 6:18 PM, Ian Downes ian.dow...@gmail.com wrote: Hello Thomas, Your impression is mostly correct: the kernel will *try* to reclaim memory by writing out dirty pages before killing processes in a cgroup but if it's unable to reclaim sufficient pages within some interval (I don't recall this off-hand) then it will start killing things. We observed this on a 3.4 kernel where we could overwhelm the disk subsystem and trigger an oom. Just how quickly this happens depends on how fast you're writing compared to how fast your disk subsystem can write it out. A simple dd if=/dev/zero of=lotsazeros bs=1M when contained in a memory cgroup will fill the cache quickly, reach its limit and get oom'ed. We were not able to reproduce this under 3.10 and 3.11 kernels. Which kernel are you using? Example: under 3.4: [idownes@hostname tmp]$ cat /proc/self/cgroup 6:perf_event:/ 4:memory:/test 3:freezer:/ 2:cpuacct:/ 1:cpu:/ [idownes@hostname tmp]$ cat /sys/fs/cgroup/memory/test/memory.limit_in_bytes # 128 MB 134217728 [idownes@hostname tmp]$ dd if=/dev/zero of=lotsazeros bs=1M Killed [idownes@hostname tmp]$ ls -lah lotsazeros -rw-r--r-- 1 idownes idownes 131M Jun 17 21:55 lotsazeros You can also look in /proc/vmstat at nr_dirty to see how many dirty pages there are (system wide). If you wrote at a rate sustainable by your disk subsystem then you would see a sawtooth pattern _/|_/| ... (use something like watch) as the cgroup approached its limit and the kernel flushed dirty pages to bring it down. This might be an interesting read: http://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ Hope this helps! Please do let us know if you're seeing this on a kernel = 3.10, otherwise it's likely this is a kernel issue rather than something with Mesos. Thanks, Ian On Tue, Jun 17, 2014 at 2:23 PM, Thomas Petr tp...@hubspot.com wrote: Hello, We're running Mesos 0.18.0 with cgroups isolation, and have run into situations where lots of file I/O causes tasks to be killed due to exceeding memory limits. Here's an example: https://gist.github.com/tpetr/ce5d80a0de9f713765f0 We were under the impression that if cache was using a lot of memory it would be reclaimed *before* the OOM process decides to kills the task. Is this accurate? We also found MESOS-762 while trying to diagnose -- could this be a regression? Thanks, Tom
Re: Failed to perform recovery: Incompatible slave info detected
Thanks, it might be worth correcting the docs in that case then. This URL says it'll use the system hostname, not the reverse DNS of the ip argument: http://mesos.apache.org/documentation/latest/configuration/ re: the CFS thing - this was while running Docker on the slaves - that also uses cgroups so maybe resources were getting split with mesos or something? (I'm still reading up on cgroups) - definitely wasn't the case until cfs was enabled. On 18 June 2014 18:34, Vinod Kone vinodk...@gmail.com wrote: Hey Dick, Regarding slave recovery, any changes in the SlaveInfo (see mesos.proto) are considered as a new slave and hence recovery doesn't proceed forward. This is because Master caches SlaveInfo and it is quite complex to reconcile the differences in SlaveInfo. So we decided to fail on any SlaveInfo changes for now. In your particular case, https://issues.apache.org/jira/browse/MESOS-672 was committed in 0.18.0 which fixed redirection of WebUI. Included in this fix is https://reviews.apache.org/r/17573/ which changed how SlaveInfo.hostname is calculated. Since you are not providing a hostname via --hostname flag, slave now deduces the hostname from --ip flag. Looks like in your cluster the hostname corresponding to that ip is different than what 'os::hostname()' gives. Couple of options to move forward. If you want slave recovery, provide --hostname that matches the previous hostname. If you don't care above recovery, just remove the meta directory (rm -rf /var/mesos/meta) so that the slave starts as a fresh one (since you are not using cgroups, you will have to manually kill any old executors/tasks that are still alive on the slave). Not sure about your comment on CFS. Enabling CFS shouldn't change how much memory the slave sees as available. More details/logs would help diagnose the issue. HTH, On Wed, Jun 18, 2014 at 4:26 AM, Dick Davies d...@hellooperator.net wrote: Should have said, the CLI for this is : /usr/local/sbin/mesos-slave --master=zk://10.10.10.105:2181/mesos --log_dir=/var/log/mesos --ip=10.10.10.101 --work_dir=/var/mesos (note IP is specified, hostname is not - docs indicated hostname arg will default to the fqdn of host, but it appears to be using the value passed as 'ip' instead.) On 18 June 2014 12:00, Dick Davies d...@hellooperator.net wrote: Hi, we recently bumped 0.17.0 - 0.18.2 and the slaves now show their IPs rather than their FQDNs on the mesos UI. This broke slave recovery with the error: Failed to perform recovery: Incompatible slave info detected cpu, mem, disk, ports are all the same. so is the 'id' field. the only thing that's changed is are the 'hostname' and webui_hostname arguments (the CLI we're passing in is exactly the same as it was on 0.17.0, so presumably this is down to a change in mesos conventions). I've had similar issues enabling CFS in test environments (slaves show less free memory and refuse to recover). is the 'id' field not enough to uniquely identify a slave?
Re: Failed to perform recovery: Incompatible slave info detected
Filed https://issues.apache.org/jira/browse/MESOS-1506 for fixing flags/documentation. On Wed, Jun 18, 2014 at 11:33 AM, Dick Davies d...@hellooperator.net wrote: Thanks, it might be worth correcting the docs in that case then. This URL says it'll use the system hostname, not the reverse DNS of the ip argument: http://mesos.apache.org/documentation/latest/configuration/ re: the CFS thing - this was while running Docker on the slaves - that also uses cgroups so maybe resources were getting split with mesos or something? (I'm still reading up on cgroups) - definitely wasn't the case until cfs was enabled. On 18 June 2014 18:34, Vinod Kone vinodk...@gmail.com wrote: Hey Dick, Regarding slave recovery, any changes in the SlaveInfo (see mesos.proto) are considered as a new slave and hence recovery doesn't proceed forward. This is because Master caches SlaveInfo and it is quite complex to reconcile the differences in SlaveInfo. So we decided to fail on any SlaveInfo changes for now. In your particular case, https://issues.apache.org/jira/browse/MESOS-672 was committed in 0.18.0 which fixed redirection of WebUI. Included in this fix is https://reviews.apache.org/r/17573/ which changed how SlaveInfo.hostname is calculated. Since you are not providing a hostname via --hostname flag, slave now deduces the hostname from --ip flag. Looks like in your cluster the hostname corresponding to that ip is different than what 'os::hostname()' gives. Couple of options to move forward. If you want slave recovery, provide --hostname that matches the previous hostname. If you don't care above recovery, just remove the meta directory (rm -rf /var/mesos/meta) so that the slave starts as a fresh one (since you are not using cgroups, you will have to manually kill any old executors/tasks that are still alive on the slave). Not sure about your comment on CFS. Enabling CFS shouldn't change how much memory the slave sees as available. More details/logs would help diagnose the issue. HTH, On Wed, Jun 18, 2014 at 4:26 AM, Dick Davies d...@hellooperator.net wrote: Should have said, the CLI for this is : /usr/local/sbin/mesos-slave --master=zk://10.10.10.105:2181/mesos --log_dir=/var/log/mesos --ip=10.10.10.101 --work_dir=/var/mesos (note IP is specified, hostname is not - docs indicated hostname arg will default to the fqdn of host, but it appears to be using the value passed as 'ip' instead.) On 18 June 2014 12:00, Dick Davies d...@hellooperator.net wrote: Hi, we recently bumped 0.17.0 - 0.18.2 and the slaves now show their IPs rather than their FQDNs on the mesos UI. This broke slave recovery with the error: Failed to perform recovery: Incompatible slave info detected cpu, mem, disk, ports are all the same. so is the 'id' field. the only thing that's changed is are the 'hostname' and webui_hostname arguments (the CLI we're passing in is exactly the same as it was on 0.17.0, so presumably this is down to a change in mesos conventions). I've had similar issues enabling CFS in test environments (slaves show less free memory and refuse to recover). is the 'id' field not enough to uniquely identify a slave?