Dalton Matos Coelho Barreto created MESOS-10066:
---------------------------------------------------

             Summary: mesos-docker-excutor process dies when agent stops. 
Recovery fails when agent returns
                 Key: MESOS-10066
                 URL: https://issues.apache.org/jira/browse/MESOS-10066
             Project: Mesos
          Issue Type: Bug
          Components: agent, containerization, docker, executor
    Affects Versions: 1.7.3
            Reporter: Dalton Matos Coelho Barreto


Hello all,

The documentation about Agent Recovery shows two conditions for the recovery to 
be possible:
 - The agent must have recovery enabled (default true?);
 - The scheduler must register itself saying that it has checkpointing enabled.

In my tests I'm using Marathon as the scheduler and Mesos itself sees Marathon 
as e checkpoint-enabled scheduler:
{noformat}
    $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, 
"id": .id, "checkpoint": .checkpoint, "active": .active}'
    {
      "name": "asgard-chronos",
      "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001",
      "checkpoint": true,
      "active": true
    }
    {
      "name": "marathon",
      "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000",
      "checkpoint": true,
      "active": true
    }
}}
{noformat}
Here is what I'm using:
 # Mesos Master, 1.4.1
 # Mesos Agent 1.7.3
 # Using docker image {{mesos/mesos-centos:1.7.x}}
 # Docker sock mounted from the host
 # Docker binary also mounted from the host
 # Marathon: 1.4.12
 # Docker
{noformat}
    Client: Docker Engine - Community
     Version:           19.03.5
     API version:       1.39 (downgraded from 1.40)
     Go version:        go1.12.12
     Git commit:        633a0ea838
     Built:             Wed Nov 13 07:22:05 2019
     OS/Arch:           linux/amd64
     Experimental:      false
    
    Server: Docker Engine - Community
     Engine:
      Version:          18.09.2
      API version:      1.39 (minimum version 1.12)
      Go version:       go1.10.6
      Git commit:       6247962
      Built:            Sun Feb 10 03:42:13 2019
      OS/Arch:          linux/amd64
      Experimental:     false
{noformat}

h2. The problem

Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} 
docker image.
{noformat}
    {
      "id": "/sleep",
      "cmd": "sleep 99d",
      "cpus": 0.1,
      "mem": 128,
      "disk": 0,
      "instances": 1,
      "constraints": [],
      "acceptedResourceRoles": [
        "*"
      ],
      "container": {
        "type": "DOCKER",
        "volumes": [],
        "docker": {
          "image": "debian",
          "network": "HOST",
          "privileged": false,
          "parameters": [],
          "forcePullImage": true
        }
      },
      "labels": {},
      "portDefinitions": []
    }
{noformat}
This task runs fine and get scheduled on the right agent, which is running 
mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}).

Here is a sample log:
{noformat}
    mesos-slave_1  | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing 
task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
    mesos-slave_1  | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching task 
'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
    mesos-slave_1  | I1205 13:24:21.392895 19849 paths.cpp:748] Creating 
sandbox 
'/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
    mesos-slave_1  | I1205 13:24:21.394399 19849 paths.cpp:748] Creating 
sandbox 
'/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
    mesos-slave_1  | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching 
executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 with resources 
[{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}]
 in work directory 
'/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
    mesos-slave_1  | I1205 13:24:21.396499 19849 slave.cpp:3078] Queued task 
'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for executor 
'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
    mesos-slave_1  | I1205 13:24:21.397038 19849 slave.cpp:3526] Launching 
container 53ec0ef3-3290-476a-b2b6-385099e9b923 for executor 
'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
    mesos-slave_1  | I1205 13:24:21.398028 19846 docker.cpp:1177] Starting 
container '53ec0ef3-3290-476a-b2b6-385099e9b923' for task 
'sleep.8c187c41-1762-11ea-a2e5-02429217540f' (and executor 
'sleep.8c187c41-1762-11ea-a2e5-02429217540f') of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
    mesos-slave_1  | W1205 13:24:22.576869 19846 slave.cpp:8496] Failed to get 
resource statistics for executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' 
of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000: Failed to run 
'/usr/bin/docker -H unix:///var/run/docker.sock inspect --type=container 
mesos-53ec0ef3-3290-476a-b2b6-385099e9b923': exited with status 1; 
stderr='Error: No such container: mesos-53ec0ef3-3290-476a-b2b6-385099e9b923'
    mesos-slave_1  | I1205 13:24:24.094985 19853 docker.cpp:792] Checkpointing 
pid 12435 to 
'/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/pids/forked.pid'
    mesos-slave_1  | I1205 13:24:24.343099 19848 slave.cpp:4839] Got 
registration for executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of 
framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from 
executor(1)@10.234.172.56:16653
    mesos-slave_1  | I1205 13:24:24.345593 19848 docker.cpp:1685] Ignoring 
updating container 53ec0ef3-3290-476a-b2b6-385099e9b923 because resources 
passed to update are identical to existing resources
    mesos-slave_1  | I1205 13:24:24.345945 19848 slave.cpp:3296] Sending queued 
task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' to executor 
'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 at executor(1)@10.234.172.56:16653
    mesos-slave_1  | I1205 13:24:24.362699 19853 slave.cpp:5310] Handling 
status update TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) 
for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from executor(1)@10.234.172.56:16653
    mesos-slave_1  | I1205 13:24:24.363222 19853 
task_status_update_manager.cpp:328] Received task status update TASK_STARTING 
(Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task 
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
    mesos-slave_1  | I1205 13:24:24.364115 19853 
task_status_update_manager.cpp:842] Checkpointing UPDATE for task status update 
TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task 
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
    mesos-slave_1  | I1205 13:24:24.364993 19847 slave.cpp:5815] Forwarding the 
update TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for 
task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to [email protected]:5050
    mesos-slave_1  | I1205 13:24:24.365594 19847 slave.cpp:5726] Sending 
acknowledgement for status update TASK_STARTING (Status UUID: 
8b06fe8c-d709-453b-817d-2948e50782c9) for task 
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to executor(1)@10.234.172.56:16653
    mesos-slave_1  | I1205 13:24:24.401759 19846 
task_status_update_manager.cpp:401] Received task status update acknowledgement 
(UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task 
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
    mesos-slave_1  | I1205 13:24:24.401926 19846 
task_status_update_manager.cpp:842] Checkpointing ACK for task status update 
TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task 
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
    mesos-slave_1  | I1205 13:24:30.481829 19850 slave.cpp:5310] Handling 
status update TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) 
for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from executor(1)@10.234.172.56:16653
    mesos-slave_1  | I1205 13:24:30.482340 19848 
task_status_update_manager.cpp:328] Received task status update TASK_RUNNING 
(Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task 
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
    mesos-slave_1  | I1205 13:24:30.482550 19848 
task_status_update_manager.cpp:842] Checkpointing UPDATE for task status update 
TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task 
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
    mesos-slave_1  | I1205 13:24:30.483163 19848 slave.cpp:5815] Forwarding the 
update TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for 
task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to [email protected]:5050
    mesos-slave_1  | I1205 13:24:30.483664 19848 slave.cpp:5726] Sending 
acknowledgement for status update TASK_RUNNING (Status UUID: 
480abd48-6a85-4703-a2f2-0455f5c4c053) for task 
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to executor(1)@10.234.172.56:16653
    mesos-slave_1  | I1205 13:24:30.557307 19852 
task_status_update_manager.cpp:401] Received task status update acknowledgement 
(UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task 
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
    mesos-slave_1  | I1205 13:24:30.557467 19852 
task_status_update_manager.cpp:842] Checkpointing ACK for task status update 
TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task 
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
{noformat}
An important part of this log:
{noformat}
Sending acknowledgement for status update TASK_RUNNING for task 
sleep.8c187c41-1762-11ea-a2e5-02429217540f to executor(1)@10.234.172.56:16653
{noformat}
Here we have the executor's address that is responsible for our newly created 
task: {{executor(1)@10.234.172.56:16653}}.

Looking at the O.S process list we see one instance of 
{{mesos-docker-executor}}:
{noformat}
    ps aux | grep executor
    root     12435  0.5  1.3 851796 54100 ?        Ssl  10:24   0:03 
mesos-docker-executor --cgroups_enable_cfs=true 
--container=mesos-53ec0ef3-3290-476a-b2b6-385099e9b923 --docker=/usr/bin/docker 
--docker_socket=/var/run/docker.sock --help=false 
--initialize_driver_logging=true --launcher_dir=/usr/libexec/mesos 
--logbufsecs=0 --logging_level=INFO --mapped_directory=/mnt/mesos/sandbox 
--quiet=false 
--sandbox_directory=/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923
 --stop_timeout=10secs
    root     12456  0.0  0.8 152636 34548 ?        Sl   10:24   0:00 
/usr/bin/docker -H unix:///var/run/docker.sock run --cpu-shares 102 --cpu-quota 
10000 --memory 134217728 -e HOST=10.234.172.56 -e 
MARATHON_APP_DOCKER_IMAGE=debian -e MARATHON_APP_ID=/sleep -e 
MARATHON_APP_LABELS=HOLLOWMAN_DEFAULT_SCALE -e 
MARATHON_APP_LABEL_HOLLOWMAN_DEFAULT_SCALE=1 -e MARATHON_APP_RESOURCE_CPUS=0.1 
-e MARATHON_APP_RESOURCE_DISK=0.0 -e MARATHON_APP_RESOURCE_GPUS=0 -e 
MARATHON_APP_RESOURCE_MEM=128.0 -e 
MARATHON_APP_VERSION=2019-12-05T13:24:18.206Z -e 
MESOS_CONTAINER_NAME=mesos-53ec0ef3-3290-476a-b2b6-385099e9b923 -e 
MESOS_SANDBOX=/mnt/mesos/sandbox -e 
MESOS_TASK_ID=sleep.8c187c41-1762-11ea-a2e5-02429217540f -v 
/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923:/mnt/mesos/sandbox
 --net host --entrypoint /bin/sh --name 
mesos-53ec0ef3-3290-476a-b2b6-385099e9b923 
--label=hollowman.appname=/asgard/sleep 
--label=MESOS_TASK_ID=sleep.8c187c41-1762-11ea-a2e5-02429217540f debian -c 
sleep 99d
    root     16776  0.0  0.0   7960   812 pts/3    S+   10:34   0:00 tail -f 
/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/stderr
{noformat}
Here we have 3 processes:
 - Mesos executor;
 - The container that this executor started, running {{sleep 99d}};
 - {{tail -f}} looking into {{stderr}} of this executor.

The {{stderr}} content is as follows:
{noformat}
    $ tail -f 
/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/stderr
    I1205 13:24:24.337672 12435 exec.cpp:162] Version: 1.7.3
    I1205 13:24:24.347107 12449 exec.cpp:236] Executor registered on agent 
79ad3a13-b567-4273-ac8c-30378d35a439-S60499
    I1205 13:24:24.349907 12451 executor.cpp:130] Registered docker executor on 
10.234.172.56
    I1205 13:24:24.350291 12449 executor.cpp:186] Starting task 
sleep.8c187c41-1762-11ea-a2e5-02429217540f
    WARNING: Your kernel does not support swap limit capabilities or the cgroup 
is not mounted. Memory limited without swap.
    W1205 13:24:29.363168 12455 executor.cpp:253] Docker inspect timed out 
after 5secs for container 'mesos-53ec0ef3-3290-476a-b2b6-385099e9b923'
{noformat}
While Mesos Agent is still running it is possible to connect on this executor 
port.
{noformat}
    telnet 10.234.172.56 16653
    Trying 10.234.172.56...
    Connected to 10.234.172.56.
    Escape character is '^]'.
    ^]
    telnet> Connection closed.
{noformat}
As soon as the Agent shutsdown (a simple {{docker stop}}), this line appears on 
the {{stderr}} log of tyhe executor:
{noformat}
    I1205 13:40:31.160290 12452 exec.cpp:518] Agent exited, but framework has 
checkpointing enabled. Waiting 15mins to reconnect with agent 
79ad3a13-b567-4273-ac8c-30378d35a439-S60499
{noformat}
And now, looking ate the O.S process list, the executor process is gone:
{noformat}
    $ ps aux | grep executor
    root     16776  0.0  0.0   7960   812 pts/3    S+   10:34   0:00 tail -f 
/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/stderr
{noformat}
and we can't telnet to that port anymore:
{noformat}
    telnet 10.234.172.56 16653
    Trying 10.234.172.56...
    telnet: Unable to connect to remote host: Connection refused
{noformat}
When we run the Mesos Agent again, we see this:
{noformat}
    mesos-slave_1  | I1205 13:42:39.925992 18421 docker.cpp:899] Recovering 
Docker containers
    mesos-slave_1  | I1205 13:42:44.006150 18421 docker.cpp:912] Got the list 
of Docker containers
    mesos-slave_1  | I1205 13:42:44.010700 18421 docker.cpp:1009] Recovering 
container '53ec0ef3-3290-476a-b2b6-385099e9b923' for executor 
'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
    mesos-slave_1  | W1205 13:42:44.011874 18421 docker.cpp:1053] Failed to 
connect to executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000: Failed to connect to 
10.234.172.56:16653: Connection refused
    mesos-slave_1  | I1205 13:42:44.016306 18421 docker.cpp:2564] Executor for 
container 53ec0ef3-3290-476a-b2b6-385099e9b923 has exited
    mesos-slave_1  | I1205 13:42:44.016412 18421 docker.cpp:2335] Destroying 
container 53ec0ef3-3290-476a-b2b6-385099e9b923 in RUNNING state
    mesos-slave_1  | I1205 13:42:44.016705 18421 docker.cpp:1133] Finished 
processing orphaned Docker containers
    mesos-slave_1  | I1205 13:42:44.017102 18421 docker.cpp:2385] Running 
docker stop on container 53ec0ef3-3290-476a-b2b6-385099e9b923
    mesos-slave_1  | I1205 13:42:44.017251 18426 slave.cpp:7205] Recovering 
executors
    mesos-slave_1  | I1205 13:42:44.017493 18426 slave.cpp:7229] Sending 
reconnect request to executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of 
framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 at 
executor(1)@10.234.172.56:16653
    mesos-slave_1  | W1205 13:42:44.018522 18426 process.cpp:1890] Failed to 
send 'mesos.internal.ReconnectExecutorMessage' to '10.234.172.56:16653', 
connect: Failed to connect to 10.234.172.56:16653: Connection refused
    mesos-slave_1  | I1205 13:42:44.019569 18422 slave.cpp:6336] Executor 
'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 has terminated with unknown status
    mesos-slave_1  | I1205 13:42:44.022147 18422 slave.cpp:5310] Handling 
status update TASK_FAILED (Status UUID: c946cd2b-1dec-4fc3-9ea6-501b40dcf4a7) 
for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from @0.0.0.0:0
    mesos-slave_1  | W1205 13:42:44.023823 18423 docker.cpp:1672] Ignoring 
updating unknown container 53ec0ef3-3290-476a-b2b6-385099e9b923
    mesos-slave_1  | I1205 13:42:44.024247 18425 
task_status_update_manager.cpp:328] Received task status update TASK_FAILED 
(Status UUID: c946cd2b-1dec-4fc3-9ea6-501b40dcf4a7) for task 
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
    mesos-slave_1  | I1205 13:42:44.024363 18425 
task_status_update_manager.cpp:842] Checkpointing UPDATE for task status update 
TASK_FAILED (Status UUID: c946cd2b-1dec-4fc3-9ea6-501b40dcf4a7) for task 
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
    mesos-slave_1  | I1205 13:42:46.020193 18423 slave.cpp:5238] Cleaning up 
un-reregistered executors
    mesos-slave_1  | I1205 13:42:46.020882 18423 slave.cpp:7358] Finished 
recovery
{noformat}
And then a new container is scheduled to run. Marathon shows a status of 
{{TASK_FAILED}} with message {{Container terminated}} for the previous task.

Special attention to the part where the Agent seems to try to connect to the 
executor port:
{noformat}
    mesos-slave_1  | I1205 13:42:44.017493 18426 slave.cpp:7229] Sending 
reconnect request to executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of 
framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 at 
executor(1)@10.234.172.56:16653
    mesos-slave_1  | W1205 13:42:44.018522 18426 process.cpp:1890] Failed to 
send 'mesos.internal.ReconnectExecutorMessage' to '10.234.172.56:16653', 
connect: Failed to connect to 10.234.172.56:16653: Connection refused
{noformat}
Looking at the mesos Agent recovery documentation the setup seems very straight 
forward. I'm using the default {{mesos-docker-executor}} to run all docker 
tasks. Is there any detail that I'm not seeing or is mis-configurated for my 
setup?

Can anyone confirm that the {{mesos-docker-executor}} is capable of doing task 
recovery?

Thank you very much and if you need any additional information, let me know.

Thanks,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to