[ https://issues.apache.org/jira/browse/MESOS-9162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16583940#comment-16583940 ]
A. Dukhovniy commented on MESOS-9162: ------------------------------------- And the complete diagnostics bundle, just in case. [^diagnostics.zip] > Unkillable pod container stuck in ISOLATING > ------------------------------------------- > > Key: MESOS-9162 > URL: https://issues.apache.org/jira/browse/MESOS-9162 > Project: Mesos > Issue Type: Bug > Affects Versions: 1.6.0, 1.7.0 > Reporter: A. Dukhovniy > Priority: Major > Attachments: dcos-marathon.service.log, dcos-mesos-master.service.gz, > dcos-mesos-slave.service.gz, diagnostics.zip, > sandbox_10_10_0_222_var_lib.tar.gz > > > We have a simple test that launches a pod with two containers (one writes in > a file and the other reads it). This test is flaky because the container > sometimes fails to start. > Marathon app definition: > {code:java} > { > "id": "/simple-pod", > "scaling": { > "kind": "fixed", > "instances": 1 > }, > "environment": { > "PING": "PONG" > }, > "containers": [ > { > "name": "ct1", > "resources": { > "cpus": 0.1, > "mem": 32 > }, > "image": { > "kind": "DOCKER", > "id": "busybox" > }, > "exec": { > "command": { > "shell": "while true; do echo the current time is $(date) > > ./test-v1/clock; sleep 1; done" > } > }, > "volumeMounts": [ > { > "name": "v1", > "mountPath": "test-v1" > } > ] > }, > { > "name": "ct2", > "resources": { > "cpus": 0.1, > "mem": 32 > }, > "exec": { > "command": { > "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep > 1; done" > } > }, > "volumeMounts": [ > { > "name": "v1", > "mountPath": "etc" > }, > { > "name": "v2", > "mountPath": "docker" > } > ] > } > ], > "networks": [ > { > "mode": "host" > } > ], > "volumes": [ > { > "name": "v1" > }, > { > "name": "v2", > "host": "/var/lib/docker" > } > ] > } > {code} > During the test, Marathon tries to launch the pod but doesn't receive a > {{TASK_RUNNING}} for the first container and so after 2min decides to kill > the pod which also fails. > Agent sandbox (attached to this ticket minus docker layers, since they're too > big to attach) shows that one of the containers wasn't started properly - the > last line in the agent log says: > {code} > Transitioning the state of container > ff4f4fdc-9327-42fb-be40-29e919e15aee.e9b05652-e779-46f8-9b76-b2e1ce7e5940 > from PREPARING to ISOLATING > {code} > Until then the log looks pretty unspektakular. > Afterwards, Marathon tries to kill the container repeatedly, but doesn't > succeed - the executor receives the reuests but doesn't send anything back: > {code} > I0816 22:52:53.111995 4 default_executor.cpp:204] Received SUBSCRIBED > event > I0816 22:52:53.112520 4 default_executor.cpp:208] Subscribed executor on > 10.10.0.222 > I0816 22:52:53.112783 4 default_executor.cpp:204] Received LAUNCH_GROUP > event > I0816 22:52:53.116516 11 default_executor.cpp:428] Setting > 'MESOS_CONTAINER_IP' to: 10.10.0.222 > I0816 22:52:53.169596 4 default_executor.cpp:204] Received ACKNOWLEDGED > event > I0816 22:52:53.194416 10 default_executor.cpp:204] Received ACKNOWLEDGED > event > I0816 22:54:50.559470 8 default_executor.cpp:204] Received KILL event > I0816 22:54:50.559496 8 default_executor.cpp:1251] Received kill for task > 'simple-pod-bcc8f180b611494aa972520b8b650ca9.instance-1ad9ecbb-a1a7-11e8-b35a-6e17842c13e2.ct1' > I0816 22:54:50.559737 4 default_executor.cpp:204] Received KILL event > I0816 22:54:50.559751 4 default_executor.cpp:1251] Received kill for task > 'simple-pod-bcc8f180b611494aa972520b8b650ca9.instance-1ad9ecbb-a1a7-11e8-b35a-6e17842c13e2.ct2' > ... > {code} > Relevant Ids for grepping the logs: > {code} > Marathon app id: /simple-pod-bcc8f180b611494aa972520b8b650ca9 > Mesos tasks id: > simple-pod-bcc8f180b611494aa972520b8b650ca9.instance-1ad9ecbb-a1a7-11e8-b35a-6e17842c13e2.ct1 > Mesos container id: e9b05652-e779-46f8-9b76-b2e1ce7e5940 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)