Dalton Matos Coelho Barreto created MESOS-10066:
---------------------------------------------------
Summary: mesos-docker-excutor process dies when agent stops.
Recovery fails when agent returns
Key: MESOS-10066
URL: https://issues.apache.org/jira/browse/MESOS-10066
Project: Mesos
Issue Type: Bug
Components: agent, containerization, docker, executor
Affects Versions: 1.7.3
Reporter: Dalton Matos Coelho Barreto
Hello all,
The documentation about Agent Recovery shows two conditions for the recovery to
be possible:
- The agent must have recovery enabled (default true?);
- The scheduler must register itself saying that it has checkpointing enabled.
In my tests I'm using Marathon as the scheduler and Mesos itself sees Marathon
as e checkpoint-enabled scheduler:
{noformat}
$ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name,
"id": .id, "checkpoint": .checkpoint, "active": .active}'
{
"name": "asgard-chronos",
"id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001",
"checkpoint": true,
"active": true
}
{
"name": "marathon",
"id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000",
"checkpoint": true,
"active": true
}
}}
{noformat}
Here is what I'm using:
# Mesos Master, 1.4.1
# Mesos Agent 1.7.3
# Using docker image {{mesos/mesos-centos:1.7.x}}
# Docker sock mounted from the host
# Docker binary also mounted from the host
# Marathon: 1.4.12
# Docker
{noformat}
Client: Docker Engine - Community
Version: 19.03.5
API version: 1.39 (downgraded from 1.40)
Go version: go1.12.12
Git commit: 633a0ea838
Built: Wed Nov 13 07:22:05 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 18.09.2
API version: 1.39 (minimum version 1.12)
Go version: go1.10.6
Git commit: 6247962
Built: Sun Feb 10 03:42:13 2019
OS/Arch: linux/amd64
Experimental: false
{noformat}
h2. The problem
Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}}
docker image.
{noformat}
{
"id": "/sleep",
"cmd": "sleep 99d",
"cpus": 0.1,
"mem": 128,
"disk": 0,
"instances": 1,
"constraints": [],
"acceptedResourceRoles": [
"*"
],
"container": {
"type": "DOCKER",
"volumes": [],
"docker": {
"image": "debian",
"network": "HOST",
"privileged": false,
"parameters": [],
"forcePullImage": true
}
},
"labels": {},
"portDefinitions": []
}
{noformat}
This task runs fine and get scheduled on the right agent, which is running
mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}).
Here is a sample log:
{noformat}
mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing
task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching task
'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating
sandbox
'/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating
sandbox
'/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
mesos-slave_1 | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching
executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 with resources
[{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}]
in work directory
'/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
mesos-slave_1 | I1205 13:24:21.396499 19849 slave.cpp:3078] Queued task
'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for executor
'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
mesos-slave_1 | I1205 13:24:21.397038 19849 slave.cpp:3526] Launching
container 53ec0ef3-3290-476a-b2b6-385099e9b923 for executor
'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
mesos-slave_1 | I1205 13:24:21.398028 19846 docker.cpp:1177] Starting
container '53ec0ef3-3290-476a-b2b6-385099e9b923' for task
'sleep.8c187c41-1762-11ea-a2e5-02429217540f' (and executor
'sleep.8c187c41-1762-11ea-a2e5-02429217540f') of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
mesos-slave_1 | W1205 13:24:22.576869 19846 slave.cpp:8496] Failed to get
resource statistics for executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f'
of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000: Failed to run
'/usr/bin/docker -H unix:///var/run/docker.sock inspect --type=container
mesos-53ec0ef3-3290-476a-b2b6-385099e9b923': exited with status 1;
stderr='Error: No such container: mesos-53ec0ef3-3290-476a-b2b6-385099e9b923'
mesos-slave_1 | I1205 13:24:24.094985 19853 docker.cpp:792] Checkpointing
pid 12435 to
'/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/pids/forked.pid'
mesos-slave_1 | I1205 13:24:24.343099 19848 slave.cpp:4839] Got
registration for executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of
framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from
executor(1)@10.234.172.56:16653
mesos-slave_1 | I1205 13:24:24.345593 19848 docker.cpp:1685] Ignoring
updating container 53ec0ef3-3290-476a-b2b6-385099e9b923 because resources
passed to update are identical to existing resources
mesos-slave_1 | I1205 13:24:24.345945 19848 slave.cpp:3296] Sending queued
task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' to executor
'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 at executor(1)@10.234.172.56:16653
mesos-slave_1 | I1205 13:24:24.362699 19853 slave.cpp:5310] Handling
status update TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9)
for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from executor(1)@10.234.172.56:16653
mesos-slave_1 | I1205 13:24:24.363222 19853
task_status_update_manager.cpp:328] Received task status update TASK_STARTING
(Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
mesos-slave_1 | I1205 13:24:24.364115 19853
task_status_update_manager.cpp:842] Checkpointing UPDATE for task status update
TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
mesos-slave_1 | I1205 13:24:24.364993 19847 slave.cpp:5815] Forwarding the
update TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for
task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to [email protected]:5050
mesos-slave_1 | I1205 13:24:24.365594 19847 slave.cpp:5726] Sending
acknowledgement for status update TASK_STARTING (Status UUID:
8b06fe8c-d709-453b-817d-2948e50782c9) for task
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to executor(1)@10.234.172.56:16653
mesos-slave_1 | I1205 13:24:24.401759 19846
task_status_update_manager.cpp:401] Received task status update acknowledgement
(UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
mesos-slave_1 | I1205 13:24:24.401926 19846
task_status_update_manager.cpp:842] Checkpointing ACK for task status update
TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
mesos-slave_1 | I1205 13:24:30.481829 19850 slave.cpp:5310] Handling
status update TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053)
for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from executor(1)@10.234.172.56:16653
mesos-slave_1 | I1205 13:24:30.482340 19848
task_status_update_manager.cpp:328] Received task status update TASK_RUNNING
(Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
mesos-slave_1 | I1205 13:24:30.482550 19848
task_status_update_manager.cpp:842] Checkpointing UPDATE for task status update
TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
mesos-slave_1 | I1205 13:24:30.483163 19848 slave.cpp:5815] Forwarding the
update TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for
task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to [email protected]:5050
mesos-slave_1 | I1205 13:24:30.483664 19848 slave.cpp:5726] Sending
acknowledgement for status update TASK_RUNNING (Status UUID:
480abd48-6a85-4703-a2f2-0455f5c4c053) for task
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to executor(1)@10.234.172.56:16653
mesos-slave_1 | I1205 13:24:30.557307 19852
task_status_update_manager.cpp:401] Received task status update acknowledgement
(UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
mesos-slave_1 | I1205 13:24:30.557467 19852
task_status_update_manager.cpp:842] Checkpointing ACK for task status update
TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
{noformat}
An important part of this log:
{noformat}
Sending acknowledgement for status update TASK_RUNNING for task
sleep.8c187c41-1762-11ea-a2e5-02429217540f to executor(1)@10.234.172.56:16653
{noformat}
Here we have the executor's address that is responsible for our newly created
task: {{executor(1)@10.234.172.56:16653}}.
Looking at the O.S process list we see one instance of
{{mesos-docker-executor}}:
{noformat}
ps aux | grep executor
root 12435 0.5 1.3 851796 54100 ? Ssl 10:24 0:03
mesos-docker-executor --cgroups_enable_cfs=true
--container=mesos-53ec0ef3-3290-476a-b2b6-385099e9b923 --docker=/usr/bin/docker
--docker_socket=/var/run/docker.sock --help=false
--initialize_driver_logging=true --launcher_dir=/usr/libexec/mesos
--logbufsecs=0 --logging_level=INFO --mapped_directory=/mnt/mesos/sandbox
--quiet=false
--sandbox_directory=/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923
--stop_timeout=10secs
root 12456 0.0 0.8 152636 34548 ? Sl 10:24 0:00
/usr/bin/docker -H unix:///var/run/docker.sock run --cpu-shares 102 --cpu-quota
10000 --memory 134217728 -e HOST=10.234.172.56 -e
MARATHON_APP_DOCKER_IMAGE=debian -e MARATHON_APP_ID=/sleep -e
MARATHON_APP_LABELS=HOLLOWMAN_DEFAULT_SCALE -e
MARATHON_APP_LABEL_HOLLOWMAN_DEFAULT_SCALE=1 -e MARATHON_APP_RESOURCE_CPUS=0.1
-e MARATHON_APP_RESOURCE_DISK=0.0 -e MARATHON_APP_RESOURCE_GPUS=0 -e
MARATHON_APP_RESOURCE_MEM=128.0 -e
MARATHON_APP_VERSION=2019-12-05T13:24:18.206Z -e
MESOS_CONTAINER_NAME=mesos-53ec0ef3-3290-476a-b2b6-385099e9b923 -e
MESOS_SANDBOX=/mnt/mesos/sandbox -e
MESOS_TASK_ID=sleep.8c187c41-1762-11ea-a2e5-02429217540f -v
/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923:/mnt/mesos/sandbox
--net host --entrypoint /bin/sh --name
mesos-53ec0ef3-3290-476a-b2b6-385099e9b923
--label=hollowman.appname=/asgard/sleep
--label=MESOS_TASK_ID=sleep.8c187c41-1762-11ea-a2e5-02429217540f debian -c
sleep 99d
root 16776 0.0 0.0 7960 812 pts/3 S+ 10:34 0:00 tail -f
/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/stderr
{noformat}
Here we have 3 processes:
- Mesos executor;
- The container that this executor started, running {{sleep 99d}};
- {{tail -f}} looking into {{stderr}} of this executor.
The {{stderr}} content is as follows:
{noformat}
$ tail -f
/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/stderr
I1205 13:24:24.337672 12435 exec.cpp:162] Version: 1.7.3
I1205 13:24:24.347107 12449 exec.cpp:236] Executor registered on agent
79ad3a13-b567-4273-ac8c-30378d35a439-S60499
I1205 13:24:24.349907 12451 executor.cpp:130] Registered docker executor on
10.234.172.56
I1205 13:24:24.350291 12449 executor.cpp:186] Starting task
sleep.8c187c41-1762-11ea-a2e5-02429217540f
WARNING: Your kernel does not support swap limit capabilities or the cgroup
is not mounted. Memory limited without swap.
W1205 13:24:29.363168 12455 executor.cpp:253] Docker inspect timed out
after 5secs for container 'mesos-53ec0ef3-3290-476a-b2b6-385099e9b923'
{noformat}
While Mesos Agent is still running it is possible to connect on this executor
port.
{noformat}
telnet 10.234.172.56 16653
Trying 10.234.172.56...
Connected to 10.234.172.56.
Escape character is '^]'.
^]
telnet> Connection closed.
{noformat}
As soon as the Agent shutsdown (a simple {{docker stop}}), this line appears on
the {{stderr}} log of tyhe executor:
{noformat}
I1205 13:40:31.160290 12452 exec.cpp:518] Agent exited, but framework has
checkpointing enabled. Waiting 15mins to reconnect with agent
79ad3a13-b567-4273-ac8c-30378d35a439-S60499
{noformat}
And now, looking ate the O.S process list, the executor process is gone:
{noformat}
$ ps aux | grep executor
root 16776 0.0 0.0 7960 812 pts/3 S+ 10:34 0:00 tail -f
/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/stderr
{noformat}
and we can't telnet to that port anymore:
{noformat}
telnet 10.234.172.56 16653
Trying 10.234.172.56...
telnet: Unable to connect to remote host: Connection refused
{noformat}
When we run the Mesos Agent again, we see this:
{noformat}
mesos-slave_1 | I1205 13:42:39.925992 18421 docker.cpp:899] Recovering
Docker containers
mesos-slave_1 | I1205 13:42:44.006150 18421 docker.cpp:912] Got the list
of Docker containers
mesos-slave_1 | I1205 13:42:44.010700 18421 docker.cpp:1009] Recovering
container '53ec0ef3-3290-476a-b2b6-385099e9b923' for executor
'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
mesos-slave_1 | W1205 13:42:44.011874 18421 docker.cpp:1053] Failed to
connect to executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000: Failed to connect to
10.234.172.56:16653: Connection refused
mesos-slave_1 | I1205 13:42:44.016306 18421 docker.cpp:2564] Executor for
container 53ec0ef3-3290-476a-b2b6-385099e9b923 has exited
mesos-slave_1 | I1205 13:42:44.016412 18421 docker.cpp:2335] Destroying
container 53ec0ef3-3290-476a-b2b6-385099e9b923 in RUNNING state
mesos-slave_1 | I1205 13:42:44.016705 18421 docker.cpp:1133] Finished
processing orphaned Docker containers
mesos-slave_1 | I1205 13:42:44.017102 18421 docker.cpp:2385] Running
docker stop on container 53ec0ef3-3290-476a-b2b6-385099e9b923
mesos-slave_1 | I1205 13:42:44.017251 18426 slave.cpp:7205] Recovering
executors
mesos-slave_1 | I1205 13:42:44.017493 18426 slave.cpp:7229] Sending
reconnect request to executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of
framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 at
executor(1)@10.234.172.56:16653
mesos-slave_1 | W1205 13:42:44.018522 18426 process.cpp:1890] Failed to
send 'mesos.internal.ReconnectExecutorMessage' to '10.234.172.56:16653',
connect: Failed to connect to 10.234.172.56:16653: Connection refused
mesos-slave_1 | I1205 13:42:44.019569 18422 slave.cpp:6336] Executor
'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 has terminated with unknown status
mesos-slave_1 | I1205 13:42:44.022147 18422 slave.cpp:5310] Handling
status update TASK_FAILED (Status UUID: c946cd2b-1dec-4fc3-9ea6-501b40dcf4a7)
for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from @0.0.0.0:0
mesos-slave_1 | W1205 13:42:44.023823 18423 docker.cpp:1672] Ignoring
updating unknown container 53ec0ef3-3290-476a-b2b6-385099e9b923
mesos-slave_1 | I1205 13:42:44.024247 18425
task_status_update_manager.cpp:328] Received task status update TASK_FAILED
(Status UUID: c946cd2b-1dec-4fc3-9ea6-501b40dcf4a7) for task
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
mesos-slave_1 | I1205 13:42:44.024363 18425
task_status_update_manager.cpp:842] Checkpointing UPDATE for task status update
TASK_FAILED (Status UUID: c946cd2b-1dec-4fc3-9ea6-501b40dcf4a7) for task
sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework
4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
mesos-slave_1 | I1205 13:42:46.020193 18423 slave.cpp:5238] Cleaning up
un-reregistered executors
mesos-slave_1 | I1205 13:42:46.020882 18423 slave.cpp:7358] Finished
recovery
{noformat}
And then a new container is scheduled to run. Marathon shows a status of
{{TASK_FAILED}} with message {{Container terminated}} for the previous task.
Special attention to the part where the Agent seems to try to connect to the
executor port:
{noformat}
mesos-slave_1 | I1205 13:42:44.017493 18426 slave.cpp:7229] Sending
reconnect request to executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of
framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 at
executor(1)@10.234.172.56:16653
mesos-slave_1 | W1205 13:42:44.018522 18426 process.cpp:1890] Failed to
send 'mesos.internal.ReconnectExecutorMessage' to '10.234.172.56:16653',
connect: Failed to connect to 10.234.172.56:16653: Connection refused
{noformat}
Looking at the mesos Agent recovery documentation the setup seems very straight
forward. I'm using the default {{mesos-docker-executor}} to run all docker
tasks. Is there any detail that I'm not seeing or is mis-configurated for my
setup?
Can anyone confirm that the {{mesos-docker-executor}} is capable of doing task
recovery?
Thank you very much and if you need any additional information, let me know.
Thanks,
--
This message was sent by Atlassian Jira
(v8.3.4#803005)