Sergei Hanus created MESOS-10197:
------------------------------------
Summary: One of processes gets incorrect status after stopping and
starting mesos-master and mesos-agent simultaneously
Key: MESOS-10197
URL: https://issues.apache.org/jira/browse/MESOS-10197
Project: Mesos
Issue Type: Bug
Reporter: Sergei Hanus
We are using mesos 1.8.0 together with marathon 1.7.50
We run several child services under marathon. When we stop and start all
services (including mesos-master and mesos-agent) or simply reboot the server,
usually everything is returning back to functional.
But, sometimes we observe, that one of child services is reported as healthy,
but in fact there is no such process on the server. When we restart mesos-sgent
once more, this child service appears as a process and actually starts working.
At the same time we observe the following message in agent log:
{code:java}
I1103 01:48:08.291822 6542 slave.cpp:5491] Killing un-reregistered executor
'ia-cloud_nexus.f09fb47b-1d66-11eb-ad1d-12962e9c065b' of framework
a99f25dd-d176-4ffd-9351-e70a357c1872-0000 at executor(1)@10.100.5.141:36452
I1103 01:48:08.291896 6542 slave.cpp:7848] Finished recovery
{code}
What could be the reason of such behavior and how to avoid it? If this
services' state is stuck somethere in agents' internal structures (metadata
file on disk, or something like that) - hwo could we cleanup this state?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)