[
https://issues.apache.org/jira/browse/MESOS-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813335#comment-16813335
]
Andrei Budnik edited comment on MESOS-9709 at 4/9/19 1:24 PM:
--------------------------------------------------------------
This agent responds on polling `/state` endpoint, but hangs on polling
`/containers` and `/__processes__`.
GDB can't attach to a running agent - it hangs.
top -H -p `pidof mesos-agent` shows that one thread stuck in D state.
Here is a stack trace of an agent's hanging thread:
{code:java}
[<ffffffff895e20d2>] copy_net_ns+0xa2/0x180
[<ffffffff890c01b9>] create_new_namespaces+0xf9/0x180
[<ffffffff890c035e>] copy_namespaces+0x8e/0xd0
[<ffffffff8908f996>] copy_process+0xb66/0x1a40
[<ffffffff89090a21>] do_fork+0x91/0x320
[<ffffffff89090d36>] SyS_clone+0x16/0x20
[<ffffffff89720c14>] stub_clone+0x44/0x70
[<ffffffffffffffff>] 0xffffffffffffffff{code}
was (Author: abudnik):
This agent responds on polling `/state` endpoint, but hangs on polling
`/containers` and `/__processes__`.
GDB can't attach to a running agent - it hangs.
Here is a stack trace of an agent's hanging thread:
{code:java}
[<ffffffff895e20d2>] copy_net_ns+0xa2/0x180
[<ffffffff890c01b9>] create_new_namespaces+0xf9/0x180
[<ffffffff890c035e>] copy_namespaces+0x8e/0xd0
[<ffffffff8908f996>] copy_process+0xb66/0x1a40
[<ffffffff89090a21>] do_fork+0x91/0x320
[<ffffffff89090d36>] SyS_clone+0x16/0x20
[<ffffffff89720c14>] stub_clone+0x44/0x70
[<ffffffffffffffff>] 0xffffffffffffffff{code}
> Docker executor can become stuck terminating
> --------------------------------------------
>
> Key: MESOS-9709
> URL: https://issues.apache.org/jira/browse/MESOS-9709
> Project: Mesos
> Issue Type: Bug
> Components: containerization
> Affects Versions: 1.8.0
> Reporter: Greg Mann
> Priority: Major
> Labels: containerization, mesosphere
> Attachments: docker-executor-stuck.txt
>
>
> See attached agent log; the executor container ID is
> {{d2bfec33-f6bd-44ee-9345-b5710780bb59}} and the executor ID contains the
> string {{819f7ef7-4f42-11e9-a566-72ec67496045}}.
> After launching the executor, we see
> {code}
> Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re
> mesos-agent[10238]: I0329 18:23:36.967316 10257 slave.cpp:3550] Launching
> container d2bfec33-f6bd-44ee-9345-b5710780bb59 for executor
> 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of
> framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321-0000
> Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re
> mesos-agent[10238]: I0329 18:23:36.968968 10253 docker.cpp:1161] No container
> info found, skipping launch
> {code}
> I'm not sure why the container info was not set. Once the executor
> reregistration timeout elapses, the agent attempts to terminate the executor
> but it does not seem to be successful. The scheduler continues to try to kill
> the task but we repeatedly see
> {code}
> Mar 29 18:35:19 int-mountvolumeagent9-soak113s.testing.mesosphe.re
> mesos-agent[10238]: W0329 18:35:19.855063 10253 slave.cpp:3823] Ignoring kill
> task datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339
> because the executor
> 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of
> framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321-0000 is terminating
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)