[ 
https://issues.apache.org/jira/browse/MESOS-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813335#comment-16813335
 ] 

Andrei Budnik commented on MESOS-9709:
--------------------------------------

This agent responds on polling `/state` endpoint, but hangs on polling 
`/containers` and `/__processes__`.

GDB can't attach to a running agent - it hangs.

Here is a stack trace of an agent's hanging thread:
{code:java}
[<ffffffff895e20d2>] copy_net_ns+0xa2/0x180
[<ffffffff890c01b9>] create_new_namespaces+0xf9/0x180
[<ffffffff890c035e>] copy_namespaces+0x8e/0xd0
[<ffffffff8908f996>] copy_process+0xb66/0x1a40
[<ffffffff89090a21>] do_fork+0x91/0x320
[<ffffffff89090d36>] SyS_clone+0x16/0x20
[<ffffffff89720c14>] stub_clone+0x44/0x70
[<ffffffffffffffff>] 0xffffffffffffffff{code}

> Docker executor can become stuck terminating
> --------------------------------------------
>
>                 Key: MESOS-9709
>                 URL: https://issues.apache.org/jira/browse/MESOS-9709
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>    Affects Versions: 1.8.0
>            Reporter: Greg Mann
>            Priority: Major
>              Labels: containerization, mesosphere
>         Attachments: docker-executor-stuck.txt
>
>
> See attached agent log; the executor container ID is 
> {{d2bfec33-f6bd-44ee-9345-b5710780bb59}} and the executor ID contains the 
> string {{819f7ef7-4f42-11e9-a566-72ec67496045}}.
> After launching the executor, we see
> {code}
> Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re 
> mesos-agent[10238]: I0329 18:23:36.967316 10257 slave.cpp:3550] Launching 
> container d2bfec33-f6bd-44ee-9345-b5710780bb59 for executor 
> 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of 
> framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321-0000
> Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re 
> mesos-agent[10238]: I0329 18:23:36.968968 10253 docker.cpp:1161] No container 
> info found, skipping launch
> {code}
> I'm not sure why the container info was not set. Once the executor 
> reregistration timeout elapses, the agent attempts to terminate the executor 
> but it does not seem to be successful. The scheduler continues to try to kill 
> the task but we repeatedly see
> {code}
> Mar 29 18:35:19 int-mountvolumeagent9-soak113s.testing.mesosphe.re 
> mesos-agent[10238]: W0329 18:35:19.855063 10253 slave.cpp:3823] Ignoring kill 
> task datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339 
> because the executor 
> 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of 
> framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321-0000 is terminating
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to