[ 
https://issues.apache.org/jira/browse/MESOS-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517485#comment-14517485
 ] 

Timothy Chen commented on MESOS-2668:
-------------------------------------

We do expect docker logs to keep running when slave is restarted as it's 
redirecting existing docker container logs to sandbox.
Docker wait is launched by an command executor and ideally should be able to 
keep running under that so we can reattach and recover.


> mesos-slave process enters failed state with "A slave (or child process) is 
> still running, please check the process(es) '{ ... }' listed in 
> /sys/fs/cgroup/cpu,cpuacct/mesos/slave/cgroups.proc"
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-2668
>                 URL: https://issues.apache.org/jira/browse/MESOS-2668
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.22.1
>         Environment: 0.22.1-rc5
> CoreOS 
>            Reporter: Jeremy Lingmann
>            Assignee: Timothy Chen
>
> Shortly after starting a test cluster we are seeing mesos-slaves enter a 
> permanent failed state after a short period (~10 minutes or so). Here is the 
> failure we are seeing with mesos-slave:
> {code}
> Apr 27 23:29:33 ip-10-229-44-239.ec2.internal systemd[1]: 
> mesos-slave.service: main process exited, code=exited, status=1/FAILURE
> Apr 27 23:29:33 ip-10-229-44-239.ec2.internal systemd[1]: Unit 
> mesos-slave.service entered failed state.
> Apr 27 23:29:33 ip-10-229-44-239.ec2.internal systemd[1]: mesos-slave.service 
> failed.
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal systemd[1]: mesos-slave.service 
> holdoff time over, scheduling restart.
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal systemd[1]: Stopping Mesos 
> Slave...
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal systemd[1]: Starting Mesos 
> Slave...
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal ping[9272]: PING leader.mesos 
> (10.155.13.144) 56(84) bytes of data.
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal ping[9272]: 64 bytes from 
> ip-10-155-13-144.ec2.internal (10.155.13.144): icmp_seq=1 ttl=49 time=0.277 ms
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal ping[9272]: --- leader.mesos 
> ping statistics ---
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal ping[9272]: 1 packets 
> transmitted, 1 received, 0% packet loss, time 0ms
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal ping[9272]: rtt 
> min/avg/max/mdev = 0.277/0.277/0.277/0.000 ms
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal systemd[1]: Started Mesos Slave.
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 
> 23:29:38.908716  9274 logging.cpp:172] INFO level logging started!
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 
> 23:29:38.909025  9274 main.cpp:156] Build: 2015-04-25 01:51:59 by
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 
> 23:29:38.909044  9274 main.cpp:158] Version: 0.22.1
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 
> 23:29:38.909054  9274 main.cpp:161] Git tag: 0.22.1-rc5
> Apr 27 23:29:38 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 
> 23:29:38.909065  9274 main.cpp:165] Git SHA: 
> 13e0536a4522c5674abc920ee9b8597d83c5352a
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 
> 23:29:39.012763  9274 containerizer.cpp:110] Using isolation: 
> cgroups/cpu,cgroups/mem
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 
> 23:29:39.027866  9274 linux_launcher.cpp:94] Using /sys/fs/cgroup/freezer as 
> the freezer hierarchy for the Linux launcher
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 
> 23:29:39.028177  9274 main.cpp:200] Starting Mesos slave
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@712: Client 
> environment:zookeeper.version=zookeeper C client 3.4.5
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@716: Client 
> environment:host.name=ip-10-229-44-239.ec2.internal
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@723: Client 
> environment:os.name=Linux
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@724: Client 
> environment:os.arch=3.19.0
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@725: Client 
> environment:os.version=#2 SMP Fri Mar 6 00:23:51 UTC 2015
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@733: Client 
> environment:user.name=(null)
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@741: Client 
> environment:user.home=/root
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@log_env@753: Client 
> environment:user.dir=/
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,028:9274(0x7f8390836700):ZOO_INFO@zookeeper_init@786: Initiating 
> client connection, host=leader.mesos:2181 sessionTimeout=10000 
> watcher=0x7f8393fa9ae0 sessionId=0 sessionPasswd=<null> 
> context=0x7f8364000970 flags=0
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 
> 23:29:39.028657  9274 slave.cpp:174] Slave started on 1)@10.229.44.239:5051
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: I0427 
> 23:29:39.028688  9274 slave.cpp:194] Moving slave process into its own cgroup 
> for subsystem: cpu
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,030:9274(0x7f83877c8700):ZOO_INFO@check_events@1703: initiated 
> connection to server [10.155.13.144:2181]
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: A slave (or 
> child process) is still running, please check the process(es) '{ 2242, 2243, 
> 2244 }' listed in /sys/fs/cgroup/cpu,cpuacct/mesos/slave/cgroups.proc
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal mesos-slave[9274]: 2015-04-27 
> 23:29:39,032:9274(0x7f83877c8700):ZOO_INFO@check_events@1750: session 
> establishment complete on server [10.155.13.144:2181], 
> sessionId=0x274cfd1112750268, negotiated timeout=10000
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal systemd[1]: 
> mesos-slave.service: main process exited, code=exited, status=1/FAILURE
> Apr 27 23:29:39 ip-10-229-44-239.ec2.internal systemd[1]: Unit 
> mesos-slave.service entered failed state.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to