[ https://issues.apache.org/jira/browse/MESOS-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15516371#comment-15516371 ]
Ian Babrou commented on MESOS-6118: ----------------------------------- I had to rework your patch a bit to apply on top of master. I then build 1.0.1 with the resulting fs.cpp: {noformat} Sep 23 12:56:49 36com72 mesos-agent[15633]: Failed to perform recovery: Collect failed: Unable to unmount volumes for Docker container '5ec94354-f785-4d13-b3ef-fb1a37eac007': Failed to get mount table: Cycle found in mount table hierarchy through entry '1': 1 1 0:2 / / rw shared:1 - rootfs rootfs rw,size=65513288k,nr_inodes=16378322 Sep 23 12:56:49 36com72 mesos-agent[15633]: 17 1 0:17 / /sys rw,nosuid,nodev,noexec,relatime shared:2 - sysfs sysfs rw Sep 23 12:56:49 36com72 mesos-agent[15633]: 18 1 0:5 / /proc rw,nosuid,nodev,noexec,relatime shared:7 - proc proc rw Sep 23 12:56:49 36com72 mesos-agent[15633]: 19 1 0:6 / /dev rw,nosuid shared:8 - devtmpfs devtmpfs rw,size=65513304k,nr_inodes=16378326,mode=755 Sep 23 12:56:49 36com72 mesos-agent[15633]: 20 17 0:18 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:3 - securityfs securityfs rw Sep 23 12:56:49 36com72 mesos-agent[15633]: 21 17 0:16 / /sys/fs/selinux rw,relatime shared:4 - selinuxfs selinuxfs rw Sep 23 12:56:49 36com72 mesos-agent[15633]: 22 19 0:19 / /dev/shm rw,nosuid,nodev shared:9 - tmpfs tmpfs rw Sep 23 12:56:49 36com72 mesos-agent[15633]: 23 19 0:13 / /dev/pts rw,nosuid,noexec,relatime shared:10 - devpts devpts rw,gid=5,mode=620,ptmxmode=000 Sep 23 12:56:49 36com72 mesos-agent[15633]: 24 1 0:20 / /run rw,nosuid,nodev shared:11 - tmpfs tmpfs rw,mode=755 Sep 23 12:56:49 36com72 mesos-agent[15633]: 25 24 0:21 / /run/lock rw,nosuid,nodev,noexec,relatime shared:12 - tmpfs tmpfs rw,size=5120k Sep 23 12:56:49 36com72 mesos-agent[15633]: 26 17 0:22 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:5 - tmpfs tmpfs ro,mode=755 Sep 23 12:56:49 36com72 mesos-agent[15633]: 27 26 0:23 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:6 - cgroup cgroup rw,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd Sep 23 12:56:49 36com72 mesos-agent[15633]: 28 26 0:24 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:13 - cgroup cgroup rw,cpuset Sep 23 12:56:49 36com72 mesos-agent[15633]: 29 26 0:25 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:14 - cgroup cgroup rw,cpu,cpuacct Sep 23 12:56:49 36com72 mesos-agent[15633]: 30 26 0:26 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,blkio Sep 23 12:56:49 36com72 mesos-agent[15633]: 31 26 0:27 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,memory Sep 23 12:56:49 36com72 mesos-agent[15633]: 32 26 0:28 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:17 - cgroup cgroup rw,devices Sep 23 12:56:49 36com72 mesos-agent[15633]: 33 26 0:29 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:18 - cgroup cgroup rw,freezer Sep 23 12:56:49 36com72 mesos-agent[15633]: 34 26 0:30 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:19 - cgroup cgroup rw,net_cls,net_prio Sep 23 12:56:49 36com72 mesos-agent[15633]: 35 26 0:31 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:20 - cgroup cgroup rw,perf_event Sep 23 12:56:49 36com72 mesos-agent[15633]: 36 26 0:32 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,hugetlb Sep 23 12:56:49 36com72 mesos-agent[15633]: 37 26 0:33 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:22 - cgroup cgroup rw,pids Sep 23 12:56:49 36com72 mesos-agent[15633]: 38 18 0:34 / /proc/sys/fs/binfmt_misc rw,relatime shared:23 - autofs systemd-1 rw,fd=22,pgrp=1,timeout=300,minproto=5,maxproto=5,direct Sep 23 12:56:49 36com72 mesos-agent[15633]: 39 19 0:35 / /dev/hugepages rw,relatime shared:24 - hugetlbfs hugetlbfs rw Sep 23 12:56:49 36com72 mesos-agent[15633]: 40 17 0:8 / /sys/kernel/debug rw,relatime shared:25 - debugfs debugfs rw Sep 23 12:56:49 36com72 mesos-agent[15633]: 41 19 0:15 / /dev/mqueue rw,relatime shared:26 - mqueue mqueue rw Sep 23 12:56:49 36com72 mesos-agent[15633]: 42 1 9:127 / /state rw,relatime shared:27 - ext4 /dev/md127 rw,stripe=384,data=ordered Sep 23 12:56:49 36com72 mesos-agent[15633]: 43 1 0:37 / /srv rw,relatime shared:28 - nfs4 10.36.14.18:/srv/hosts/36com72 rw,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.36.23.25,local_lock=none,addr=10.36.14.18 Sep 23 12:56:49 36com72 mesos-agent[15633]: 44 1 0:37 / /srv-master rw,relatime shared:29 - nfs4 10.36.14.18:/srv rw,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.36.23.25,local_lock=none,addr=10.36.14.18 Sep 23 12:56:49 36com72 mesos-agent[15633]: 45 38 0:36 / /proc/sys/fs/binfmt_misc rw,relatime shared:30 - binfmt_misc binfmt_misc rw Sep 23 12:56:49 36com72 mesos-agent[15633]: To remedy this do as follows: Sep 23 12:56:49 36com72 mesos-agent[15633]: Step 1: rm -f /state/var/lib/mesos/meta/slaves/latest Sep 23 12:56:49 36com72 mesos-agent[15633]: This ensures agent doesn't recover old live executors. Sep 23 12:56:49 36com72 mesos-agent[15633]: Step 2: Restart the agent. {noformat} I'm on Debian Jessie and Linux 4.4.17. > Agent would crash with docker container tasks due to host mount table read. > --------------------------------------------------------------------------- > > Key: MESOS-6118 > URL: https://issues.apache.org/jira/browse/MESOS-6118 > Project: Mesos > Issue Type: Bug > Components: slave > Affects Versions: 1.0.1 > Environment: Build: 2016-08-26 23:06:27 by centos > Version: 1.0.1 > Git tag: 1.0.1 > Git SHA: 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3 > systemd version `219` detected > Inializing systemd state > Created systemd slice: `/run/systemd/system/mesos_executors.slice` > Started systemd slice `mesos_executors.slice` > Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni > Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher > Linux ip-10-254-192-40 3.10.0-327.28.3.el7.x86_64 #1 SMP Thu Aug 18 19:05:49 > UTC 2016 x86_64 x86_64 x86_64 GNU/Linux > Reporter: Jamie Briant > Assignee: Kevin Klues > Priority: Critical > Labels: linux, slave > Fix For: 1.1.0, 1.0.2 > > Attachments: crashlogfull.log, cycle2.log, cycle3.log, cycle5.log, > cycle6.log, slave-crash.log > > > I have a framework which schedules thousands of short running (a few seconds > to a few minutes) of tasks, over a period of several minutes. In 1.0.1, the > slave process will crash every few minutes (with systemd restarting it). > Crash is: > Sep 01 20:52:23 ip-10-254-192-99 mesos-slave: F0901 20:52:23.905678 1232 > fs.cpp:140] Check failed: !visitedParents.contains(parentId) > Sep 01 20:52:23 ip-10-254-192-99 mesos-slave: *** Check failure stack trace: > *** > Version 1.0.0 works without this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)