[ https://issues.apache.org/jira/browse/MESOS-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14518364#comment-14518364 ]
Adam B commented on MESOS-2668: ------------------------------- I'll cut a new rc6 once this is committed. Seems like a blocker for 0.22.1. > mesos-slave process enters failed state with "A slave (or child process) is > still running, please check the process(es) '{ ... }' listed in > /sys/fs/cgroup/cpu,cpuacct/mesos/slave/cgroups.proc" > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > Key: MESOS-2668 > URL: https://issues.apache.org/jira/browse/MESOS-2668 > Project: Mesos > Issue Type: Bug > Affects Versions: 0.22.1 > Environment: 0.22.1-rc5 > CoreOS > Reporter: Jeremy Lingmann > Assignee: Timothy Chen > > Shortly after starting a test cluster we are seeing mesos-slaves enter a > permanent failed state after a short period (~10 minutes or so). Here is the > failure we are seeing with mesos-slave: > {code} > Apr 27 23:29:33 ip-10-229-44-239.ec2.internal systemd[1]: > mesos-slave.service: main process exited, code=exited, status=1/FAILURE > Apr 27 23:29:33 ip-10-229-44-239.ec2.internal systemd[1]: Unit > mesos-slave.service entered failed state. > Apr 27 23:29:33 ip-10-229-44-239.ec2.internal systemd[1]: mesos-slave.service > failed. > Apr 27 23:29:38 ip-10-229-44-239.ec2.internal systemd[1]: mesos-slave.service > holdoff time over, scheduling restart. > Apr 27 23:29:38 ip-10-229-44-239.ec2.internal systemd[1]: Stopping Mesos > Slave... > Apr 27 23:29:38 ip-10-229-44-239.ec2.internal systemd[1]: Starting Mesos > Slave... > Apr 27 23:29:38 ip-10-229-44-239.ec2.internal ping[9272]: PING leader.mesos > (10.155.13.144) 56(84) bytes of data. > Apr 27 23:29:38 ip-10-229-44-239.ec2.internal ping[9272]: 64 bytes from > ip-10-155-13-144.ec2.internal (10.155.13.144): icmp_seq=1 ttl=49 time=0.277 ms > Apr 27 23:29:38 ip-10-229-44-239.ec2.internal ping[9272]: --- leader.mesos > ping statistics --- > Apr 27 23:29:38 ip-10-229-44-239.ec2.internal ping[9272]: 1 packets > transmitted, 1 received, 0% packet loss, time 0ms > Apr 27 23:29:38 ip-10-229-44-239.ec2.internal ping[9272]: rtt > min/avg/max/mdev = 0.277/0.277/0.277/0.000 ms > Apr 27 23:29:38 ip-10-229-44-239.ec2.internal systemd[1]: Started Mesos Slave. > Apr 27 23:29:38 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 > 23:29:38.908716 9274 logging.cpp:172] INFO level logging started! > Apr 27 23:29:38 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 > 23:29:38.909025 9274 main.cpp:156] Build: 2015-04-25 01:51:59 by > Apr 27 23:29:38 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 > 23:29:38.909044 9274 main.cpp:158] Version: 0.22.1 > Apr 27 23:29:38 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 > 23:29:38.909054 9274 main.cpp:161] Git tag: 0.22.1-rc5 > Apr 27 23:29:38 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 > 23:29:38.909065 9274 main.cpp:165] Git SHA: > 13e0536a4522c5674abc920ee9b8597d83c5352a > Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 > 23:29:39.012763 9274 containerizer.cpp:110] Using isolation: > cgroups/cpu,cgroups/mem > Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 > 23:29:39.027866 9274 linux_launcher.cpp:94] Using /sys/fs/cgroup/freezer as > the freezer hierarchy for the Linux launcher > Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 > 23:29:39.028177 9274 main.cpp:200] Starting Mesos slave > Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 > 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@712: Client > environment:zookeeper.version=zookeeper C client 3.4.5 > Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 > 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@716: Client > environment:host.name=ip-10-229-44-239.ec2.internal > Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 > 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@723: Client > environment:os.name=Linux > Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 > 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@724: Client > environment:os.arch=3.19.0 > Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 > 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@725: Client > environment:os.version=#2 SMP Fri Mar 6 00:23:51 UTC 2015 > Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 > 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@733: Client > environment:user.name=(null) > Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 > 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@741: Client > environment:user.home=/root > Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 > 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@753: Client > environment:user.dir=/ > Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 > 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@zookeeper_init@786: Initiating > client connection, host=leader.mesos:2181 sessionTimeout=10000 > watcher=0x7f8393fa9ae0 sessionId=0 sessionPasswd=<null> > context=0x7f8364000970 flags=0 > Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 > 23:29:39.028657 9274 slave.cpp:174] Slave started on 1)@10.229.44.239:5051 > Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 > 23:29:39.028688 9274 slave.cpp:194] Moving slave process into its own cgroup > for subsystem: cpu > Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 > 23:29:39,030:9274(0x7f83877c8700):ZOO_INFO@check_events@1703: initiated > connection to server [10.155.13.144:2181] > Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: A slave (or > child process) is still running, please check the process(es) '{ 2242, 2243, > 2244 }' listed in /sys/fs/cgroup/cpu,cpuacct/mesos/slave/cgroups.proc > Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 > 23:29:39,032:9274(0x7f83877c8700):ZOO_INFO@check_events@1750: session > establishment complete on server [10.155.13.144:2181], > sessionId=0x274cfd1112750268, negotiated timeout=10000 > Apr 27 23:29:39 ip-10-229-44-239.ec2.internal systemd[1]: > mesos-slave.service: main process exited, code=exited, status=1/FAILURE > Apr 27 23:29:39 ip-10-229-44-239.ec2.internal systemd[1]: Unit > mesos-slave.service entered failed state. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)