Re: cgroups memory isolation

2014-06-18 Thread Thomas Petr
Eric pointed out that I had a typo in the instance type -- it's a
c3.8xlarge (containing SSDs, which could make a difference here).


On Wed, Jun 18, 2014 at 10:36 AM, Thomas Petr tp...@hubspot.com wrote:

 Thanks for all the info, Ian. We're running CentOS 6 with the 2.6.32
 kernel.

 I ran `dd if=/dev/zero of=lotsazeros bs=1M` as a task in Mesos and got
 some weird results. I initially gave the task 256 MB, and it never exceeded
 the memory allocation (I killed the task manually after 5 minutes when the
 file hit 50 GB). Then I noticed your example was 128 MB, so I resized and
 tried again. It exceeded memory
 https://gist.github.com/tpetr/d4ff2adda1b5b0a21f82 almost
 immediately. The next (replacement) task our framework started ran
 successfully and never exceeded memory. I watched nr_dirty and it
 fluctuated between 1 to 14000 when the task is running. The slave host
 is a c3.xlarge in EC2, if it makes a difference.

 As Mesos users, we'd like an isolation strategy that isn't affected by
 cache this much -- it makes it harder for us to appropriately size things.
 Is it possible through Mesos or cgroups itself to make the page cache not
 count towards the total memory consumption? If the answer is no, do you
 think it'd be worth looking at using Docker for isolation instead?

 -Tom


 On Tue, Jun 17, 2014 at 6:18 PM, Ian Downes ian.dow...@gmail.com wrote:

 Hello Thomas,

 Your impression is mostly correct: the kernel will *try* to reclaim
 memory by writing out dirty pages before killing processes in a cgroup
 but if it's unable to reclaim sufficient pages within some interval (I
 don't recall this off-hand) then it will start killing things.

 We observed this on a 3.4 kernel where we could overwhelm the disk
 subsystem and trigger an oom. Just how quickly this happens depends on
 how fast you're writing compared to how fast your disk subsystem can
 write it out. A simple dd if=/dev/zero of=lotsazeros bs=1M when
 contained in a memory cgroup will fill the cache quickly, reach its
 limit and get oom'ed. We were not able to reproduce this under 3.10
 and 3.11 kernels. Which kernel are you using?

 Example: under 3.4:

 [idownes@hostname tmp]$ cat /proc/self/cgroup
 6:perf_event:/
 4:memory:/test
 3:freezer:/
 2:cpuacct:/
 1:cpu:/
 [idownes@hostname tmp]$ cat
 /sys/fs/cgroup/memory/test/memory.limit_in_bytes  # 128 MB
 134217728
 [idownes@hostname tmp]$ dd if=/dev/zero of=lotsazeros bs=1M
 Killed
 [idownes@hostname tmp]$ ls -lah lotsazeros
 -rw-r--r-- 1 idownes idownes 131M Jun 17 21:55 lotsazeros


 You can also look in /proc/vmstat at nr_dirty to see how many dirty
 pages there are (system wide). If you wrote at a rate sustainable by
 your disk subsystem then you would see a sawtooth pattern _/|_/| ...
 (use something like watch) as the cgroup approached its limit and the
 kernel flushed dirty pages to bring it down.

 This might be an interesting read:

 http://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/

 Hope this helps! Please do let us know if you're seeing this on a
 kernel = 3.10, otherwise it's likely this is a kernel issue rather
 than something with Mesos.

 Thanks,
 Ian


 On Tue, Jun 17, 2014 at 2:23 PM, Thomas Petr tp...@hubspot.com wrote:
  Hello,
 
  We're running Mesos 0.18.0 with cgroups isolation, and have run into
  situations where lots of file I/O causes tasks to be killed due to
 exceeding
  memory limits. Here's an example:
  https://gist.github.com/tpetr/ce5d80a0de9f713765f0
 
  We were under the impression that if cache was using a lot of memory it
  would be reclaimed *before* the OOM process decides to kills the task.
 Is
  this accurate? We also found MESOS-762 while trying to diagnose -- could
  this be a regression?
 
  Thanks,
  Tom





Re: Failed to perform recovery: Incompatible slave info detected

2014-06-18 Thread Dick Davies
Thanks, it might be worth correcting the docs in that case then.
This URL says it'll use the system hostname, not the reverse DNS of
the ip argument:

http://mesos.apache.org/documentation/latest/configuration/

re: the CFS thing - this was while running Docker on the slaves - that
also uses cgroups
so maybe resources were getting split with mesos or something? (I'm
still reading up on
cgroups) - definitely wasn't the case until cfs was enabled.


On 18 June 2014 18:34, Vinod Kone vinodk...@gmail.com wrote:
 Hey Dick,

 Regarding slave recovery, any changes in the SlaveInfo (see mesos.proto) are
 considered as a new slave and hence recovery doesn't proceed forward. This
 is because Master caches SlaveInfo and it is quite complex to reconcile the
 differences in SlaveInfo. So we decided to fail on any SlaveInfo changes for
 now.

 In your particular case, https://issues.apache.org/jira/browse/MESOS-672 was
 committed in 0.18.0 which fixed redirection
  of WebUI. Included in this fix is https://reviews.apache.org/r/17573/ which
 changed how SlaveInfo.hostname is calculated. Since you are not providing a
 hostname via --hostname flag, slave now deduces the hostname from --ip
 flag. Looks like in your cluster the hostname corresponding to that ip is
 different than what 'os::hostname()' gives.

 Couple of options to move forward. If you want slave recovery, provide
 --hostname that matches the previous hostname. If you don't care above
 recovery, just remove the meta directory (rm -rf /var/mesos/meta) so that
 the slave starts as a fresh one (since you are not using cgroups, you will
 have to manually kill any old executors/tasks that are still alive on the
 slave).

 Not sure about your comment on CFS. Enabling CFS shouldn't change how much
 memory the slave sees as available. More details/logs would help diagnose
 the issue.

 HTH,



 On Wed, Jun 18, 2014 at 4:26 AM, Dick Davies d...@hellooperator.net wrote:

 Should have said, the CLI for this is :

 /usr/local/sbin/mesos-slave --master=zk://10.10.10.105:2181/mesos
 --log_dir=/var/log/mesos --ip=10.10.10.101 --work_dir=/var/mesos

 (note IP is specified, hostname is not - docs indicated hostname arg
 will default to the fqdn of host, but it appears to be using the value
 passed as 'ip' instead.)

 On 18 June 2014 12:00, Dick Davies d...@hellooperator.net wrote:
  Hi, we recently bumped 0.17.0 - 0.18.2 and the slaves
  now show their IPs rather than their FQDNs on the mesos UI.
 
  This broke slave recovery with the error:
 
  Failed to perform recovery: Incompatible slave info detected
 
 
  cpu, mem, disk, ports are all the same. so is the 'id' field.
 
  the only thing that's changed is are the 'hostname' and webui_hostname
  arguments
  (the CLI we're passing in is exactly the same as it was on 0.17.0, so
  presumably this is down to a change in mesos conventions).
 
  I've had similar issues enabling CFS in test environments (slaves show
  less free memory and refuse to recover).
 
  is the 'id' field not enough to uniquely identify a slave?




Re: Failed to perform recovery: Incompatible slave info detected

2014-06-18 Thread Vinod Kone
Filed https://issues.apache.org/jira/browse/MESOS-1506 for fixing
flags/documentation.


On Wed, Jun 18, 2014 at 11:33 AM, Dick Davies d...@hellooperator.net
wrote:

 Thanks, it might be worth correcting the docs in that case then.
 This URL says it'll use the system hostname, not the reverse DNS of
 the ip argument:

 http://mesos.apache.org/documentation/latest/configuration/

 re: the CFS thing - this was while running Docker on the slaves - that
 also uses cgroups
 so maybe resources were getting split with mesos or something? (I'm
 still reading up on
 cgroups) - definitely wasn't the case until cfs was enabled.


 On 18 June 2014 18:34, Vinod Kone vinodk...@gmail.com wrote:
  Hey Dick,
 
  Regarding slave recovery, any changes in the SlaveInfo (see mesos.proto)
 are
  considered as a new slave and hence recovery doesn't proceed forward.
 This
  is because Master caches SlaveInfo and it is quite complex to reconcile
 the
  differences in SlaveInfo. So we decided to fail on any SlaveInfo changes
 for
  now.
 
  In your particular case, https://issues.apache.org/jira/browse/MESOS-672
 was
  committed in 0.18.0 which fixed redirection
   of WebUI. Included in this fix is https://reviews.apache.org/r/17573/
 which
  changed how SlaveInfo.hostname is calculated. Since you are not
 providing a
  hostname via --hostname flag, slave now deduces the hostname from
 --ip
  flag. Looks like in your cluster the hostname corresponding to that ip is
  different than what 'os::hostname()' gives.
 
  Couple of options to move forward. If you want slave recovery, provide
  --hostname that matches the previous hostname. If you don't care above
  recovery, just remove the meta directory (rm -rf /var/mesos/meta) so
 that
  the slave starts as a fresh one (since you are not using cgroups, you
 will
  have to manually kill any old executors/tasks that are still alive on the
  slave).
 
  Not sure about your comment on CFS. Enabling CFS shouldn't change how
 much
  memory the slave sees as available. More details/logs would help diagnose
  the issue.
 
  HTH,
 
 
 
  On Wed, Jun 18, 2014 at 4:26 AM, Dick Davies d...@hellooperator.net
 wrote:
 
  Should have said, the CLI for this is :
 
  /usr/local/sbin/mesos-slave --master=zk://10.10.10.105:2181/mesos
  --log_dir=/var/log/mesos --ip=10.10.10.101 --work_dir=/var/mesos
 
  (note IP is specified, hostname is not - docs indicated hostname arg
  will default to the fqdn of host, but it appears to be using the value
  passed as 'ip' instead.)
 
  On 18 June 2014 12:00, Dick Davies d...@hellooperator.net wrote:
   Hi, we recently bumped 0.17.0 - 0.18.2 and the slaves
   now show their IPs rather than their FQDNs on the mesos UI.
  
   This broke slave recovery with the error:
  
   Failed to perform recovery: Incompatible slave info detected
  
  
   cpu, mem, disk, ports are all the same. so is the 'id' field.
  
   the only thing that's changed is are the 'hostname' and webui_hostname
   arguments
   (the CLI we're passing in is exactly the same as it was on 0.17.0, so
   presumably this is down to a change in mesos conventions).
  
   I've had similar issues enabling CFS in test environments (slaves show
   less free memory and refuse to recover).
  
   is the 'id' field not enough to uniquely identify a slave?