[ 
https://issues.apache.org/jira/browse/MESOS-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517640#comment-14517640
 ] 

Ian Downes commented on MESOS-2668:
-----------------------------------

I'm completely okay with removing the CHECK, it's not necessary. I think my 
original motivation was simply because I didn't expect there to be any 
processes after the slave restarted, i.e., I neglected to think about things 
like perf and du; it should have been just a LOG(INFO) or LOG(WARNING) right 
from the start.

> mesos-slave process enters failed state with "A slave (or child process) is 
> still running, please check the process(es) '{ ... }' listed in 
> /sys/fs/cgroup/cpu,cpuacct/mesos/slave/cgroups.proc"
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-2668
>                 URL: https://issues.apache.org/jira/browse/MESOS-2668
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.22.1
>         Environment: 0.22.1-rc5
> CoreOS 
>            Reporter: Jeremy Lingmann
>            Assignee: Timothy Chen
>
> Shortly after starting a test cluster we are seeing mesos-slaves enter a 
> permanent failed state after a short period (~10 minutes or so). Here is the 
> failure we are seeing with mesos-slave:
> {code}
> Apr 27 23:29:33 ip-10-229-44-239.ec2.internal systemd[1]: 
> mesos-slave.service: main process exited, code=exited, status=1/FAILURE
> Apr 27 23:29:33 ip-10-229-44-239.ec2.internal systemd[1]: Unit 
> mesos-slave.service entered failed state.
> Apr 27 23:29:33 ip-10-229-44-239.ec2.internal systemd[1]: mesos-slave.service 
> failed.
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal systemd[1]: mesos-slave.service 
> holdoff time over, scheduling restart.
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal systemd[1]: Stopping Mesos 
> Slave...
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal systemd[1]: Starting Mesos 
> Slave...
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal ping[9272]: PING leader.mesos 
> (10.155.13.144) 56(84) bytes of data.
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal ping[9272]: 64 bytes from 
> ip-10-155-13-144.ec2.internal (10.155.13.144): icmp_seq=1 ttl=49 time=0.277 ms
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal ping[9272]: --- leader.mesos 
> ping statistics ---
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal ping[9272]: 1 packets 
> transmitted, 1 received, 0% packet loss, time 0ms
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal ping[9272]: rtt 
> min/avg/max/mdev = 0.277/0.277/0.277/0.000 ms
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal systemd[1]: Started Mesos Slave.
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 
> 23:29:38.908716  9274 logging.cpp:172] INFO level logging started!
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 
> 23:29:38.909025  9274 main.cpp:156] Build: 2015-04-25 01:51:59 by
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 
> 23:29:38.909044  9274 main.cpp:158] Version: 0.22.1
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 
> 23:29:38.909054  9274 main.cpp:161] Git tag: 0.22.1-rc5
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 
> 23:29:38.909065  9274 main.cpp:165] Git SHA: 
> 13e0536a4522c5674abc920ee9b8597d83c5352a
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 
> 23:29:39.012763  9274 containerizer.cpp:110] Using isolation: 
> cgroups/cpu,cgroups/mem
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 
> 23:29:39.027866  9274 linux_launcher.cpp:94] Using /sys/fs/cgroup/freezer as 
> the freezer hierarchy for the Linux launcher
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 
> 23:29:39.028177  9274 main.cpp:200] Starting Mesos slave
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@712: Client 
> environment:zookeeper.version=zookeeper C client 3.4.5
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@716: Client 
> environment:host.name=ip-10-229-44-239.ec2.internal
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@723: Client 
> environment:os.name=Linux
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@724: Client 
> environment:os.arch=3.19.0
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@725: Client 
> environment:os.version=#2 SMP Fri Mar 6 00:23:51 UTC 2015
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@733: Client 
> environment:user.name=(null)
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@741: Client 
> environment:user.home=/root
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@753: Client 
> environment:user.dir=/
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@zookeeper_init@786: Initiating 
> client connection, host=leader.mesos:2181 sessionTimeout=10000 
> watcher=0x7f8393fa9ae0 sessionId=0 sessionPasswd=<null> 
> context=0x7f8364000970 flags=0
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 
> 23:29:39.028657  9274 slave.cpp:174] Slave started on 1)@10.229.44.239:5051
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 
> 23:29:39.028688  9274 slave.cpp:194] Moving slave process into its own cgroup 
> for subsystem: cpu
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,030:9274(0x7f83877c8700):ZOO_INFO@check_events@1703: initiated 
> connection to server [10.155.13.144:2181]
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: A slave (or 
> child process) is still running, please check the process(es) '{ 2242, 2243, 
> 2244 }' listed in /sys/fs/cgroup/cpu,cpuacct/mesos/slave/cgroups.proc
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,032:9274(0x7f83877c8700):ZOO_INFO@check_events@1750: session 
> establishment complete on server [10.155.13.144:2181], 
> sessionId=0x274cfd1112750268, negotiated timeout=10000
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal systemd[1]: 
> mesos-slave.service: main process exited, code=exited, status=1/FAILURE
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal systemd[1]: Unit 
> mesos-slave.service entered failed state.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to