[
https://issues.apache.org/jira/browse/MESOS-10197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229772#comment-17229772
]
Sergei Hanus commented on MESOS-10197:
--------------------------------------
I also tried to check this with the latest 1.10 release - still the same
behavior. Agent does not report failed service state, even after I restart the
agent. Only cleanup of the corresponding executor in meta folder helps to
restore service back to functional.
> One of processes gets incorrect status after stopping and starting
> mesos-master and mesos-agent simultaneously
> --------------------------------------------------------------------------------------------------------------
>
> Key: MESOS-10197
> URL: https://issues.apache.org/jira/browse/MESOS-10197
> Project: Mesos
> Issue Type: Bug
> Reporter: Sergei Hanus
> Priority: Major
>
> We are using mesos 1.8.0 together with marathon 1.7.50
> We run several child services under marathon. When we stop and start all
> services (including mesos-master and mesos-agent) or simply reboot the
> server, usually everything is returning back to functional.
> But, sometimes we observe, that one of child services is reported as healthy,
> but in fact there is no such process on the server. When we restart
> mesos-sgent once more, this child service appears as a process and actually
> starts working.
>
> At the same time we observe the following messages in stderr of executor:
>
> {code:java}
> 1103 01:46:18.725567 5922 exec.cpp:518] Agent exited, but framework has
> checkpointing enabled. Waiting 20secs to reconnect with agent
> a99f25dd-d176-4ffd-9351-e70a357c1872-S1
> 200
> I1103 01:46:25.777845 5921 checker_process.cpp:986] COMMAND health check for
> task 'ia-cloud_nexus.f09fb47b-1d66-11eb-ad1d-12962e9c065b' returned: 0
> I1103 01:46:38.732237 5925 exec.cpp:499] Recovery timeout of 20secs exceeded;
> Shutting down
> I1103 01:46:38.732295 5925 exec.cpp:445] Executor asked to shutdown
> I1103 01:46:38.732455 5925 executor.cpp:190] Received SHUTDOWN event
> I1103 01:46:38.732482 5925 executor.cpp:829] Shutting down
> I1103 01:46:38.733742 5925 executor.cpp:942] Sending SIGTERM to process tree
> at pid 5927
> W1103 01:46:38.741338 5926 process.cpp:1890] Failed to send
> 'mesos.internal.StatusUpdateMessage' to '10.100.5.141:5051', connect: Failed
> to connect to 10.100.5.141:5051: Connection refused
> I1103 01:46:38.771276 5925 executor.cpp:955] Sent SIGTERM to the following
> process trees:{code}
>
> And in mesos-slave.log:
>
> {code:java}
> I1103 01:48:08.291822 6542 slave.cpp:5491] Killing un-reregistered executor
> 'ia-cloud_nexus.f09fb47b-1d66-11eb-ad1d-12962e9c065b' of framework
> a99f25dd-d176-4ffd-9351-e70a357c1872-0000 at executor(1)@10.100.5.141:36452
> I1103 01:48:08.291896 6542 slave.cpp:7848] Finished recovery
> {code}
>
> Also, in mesos-slave.log I see a bunch of messages for services, which were
> successfully restarted, like that:
>
> {code:java}
> I1103 01:48:06.278928 6542 containerizer.cpp:854] Recovering container
> 3f082a3f-88a1-4ca8-b92c-94d4c55f80d9 for executor
> 'ia-cloud_worker-management-service.735fd7ad-1d67-11eb-ad1d-12962e9c065b' of
> framework a99f25dd-d176-4ffd-9351-e70a357c1872-0000
> I1103 01:48:06.285398 6541 containerizer.cpp:3117] Container
> 3f082a3f-88a1-4ca8-b92c-94d4c55f80d9 has exited
> I1103 01:48:06.285406 6541 containerizer.cpp:2576] Destroying container
> 3f082a3f-88a1-4ca8-b92c-94d4c55f80d9 in RUNNING state
> I1103 01:48:06.285413 6541 containerizer.cpp:3278] Transitioning the state
> of container 3f082a3f-88a1-4ca8-b92c-94d4c55f80d9 from RUNNING to DESTROYING
> I1103 01:48:06.311017 6541 launcher.cpp:161] Asked to destroy container
> 3f082a3f-88a1-4ca8-b92c-94d4c55f80d9
> W1103 01:48:06.398953 6540 containerizer.cpp:2375] Ignoring update for
> unknown container 3f082a3f-88a1-4ca8-b92c-94d4c55f80d9
> {code}
>
> But for the problematic service messages are different (there's no event
> that container has exited):
>
>
> {code:java}
> I1103 01:48:06.278280 6542 containerizer.cpp:854] Recovering container
> 570461ae-ac26-445e-8cbf-42e366d5ee91 for executor
> 'ia-cloud_nexus.f09fb47b-1d66-11eb-ad1d-12962e9c065b' of framework
> a99f25dd-d176-4ffd-9351-e70a357c1872-0000
> I1103 01:48:08.292357 6542 containerizer.cpp:2576] Destroying container
> 570461ae-ac26-445e-8cbf-42e366d5ee91 in RUNNING state
> I1103 01:48:08.292371 6542 containerizer.cpp:3278] Transitioning the state of
> container 570461ae-ac26-445e-8cbf-42e366d5ee91 from RUNNING to DESTROYING
> I1103 01:48:08.292414 6542 launcher.cpp:161] Asked to destroy container
> 570461ae-ac26-445e-8cbf-42e366d5ee91
> {code}
>
>
>
> What could be the reason of such behavior and how to avoid it? If this
> services' state is stuck somethere in agents' internal structures (metadata
> file on disk, or something like that) - how could we cleanup this state?
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)