[jira] [Commented] (MESOS-10197) One of processes gets incorrect status after stopping and starting mesos-master and mesos-agent simultaneously

Sergei Hanus (Jira) Tue, 10 Nov 2020 22:28:04 -0800


    [ 
https://issues.apache.org/jira/browse/MESOS-10197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229772#comment-17229772
 ]


Sergei Hanus commented on MESOS-10197:
--------------------------------------

I also tried to check this with the latest 1.10 release - still the same 
behavior. Agent does not report failed service state, even after I restart the 
agent. Only cleanup of the corresponding executor in meta folder helps to 
restore service back to functional.

 

> One of processes gets incorrect status after stopping and starting 
> mesos-master and mesos-agent simultaneously
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-10197
>                 URL: https://issues.apache.org/jira/browse/MESOS-10197
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Sergei Hanus
>            Priority: Major
>
> We are using mesos 1.8.0 together with marathon 1.7.50
> We run several child services under marathon. When we stop and start all 
> services (including mesos-master and mesos-agent) or simply reboot the 
> server, usually everything is returning back to functional.
> But, sometimes we observe, that one of child services is reported as healthy, 
> but in fact there is no such process on the server. When we restart 
> mesos-sgent once more, this child service appears as a process and actually 
> starts working.
>  
> At the same time we observe the following messages in stderr of executor:
>  
> {code:java}
> 1103 01:46:18.725567 5922 exec.cpp:518] Agent exited, but framework has 
> checkpointing enabled. Waiting 20secs to reconnect with agent 
> a99f25dd-d176-4ffd-9351-e70a357c1872-S1
> 200
> I1103 01:46:25.777845 5921 checker_process.cpp:986] COMMAND health check for 
> task 'ia-cloud_nexus.f09fb47b-1d66-11eb-ad1d-12962e9c065b' returned: 0
> I1103 01:46:38.732237 5925 exec.cpp:499] Recovery timeout of 20secs exceeded; 
> Shutting down
> I1103 01:46:38.732295 5925 exec.cpp:445] Executor asked to shutdown
> I1103 01:46:38.732455 5925 executor.cpp:190] Received SHUTDOWN event
> I1103 01:46:38.732482 5925 executor.cpp:829] Shutting down
> I1103 01:46:38.733742 5925 executor.cpp:942] Sending SIGTERM to process tree 
> at pid 5927
> W1103 01:46:38.741338 5926 process.cpp:1890] Failed to send 
> 'mesos.internal.StatusUpdateMessage' to '10.100.5.141:5051', connect: Failed 
> to connect to 10.100.5.141:5051: Connection refused
> I1103 01:46:38.771276 5925 executor.cpp:955] Sent SIGTERM to the following 
> process trees:{code}
>  
> And in mesos-slave.log:
>  
> {code:java}
> I1103 01:48:08.291822  6542 slave.cpp:5491] Killing un-reregistered executor 
> 'ia-cloud_nexus.f09fb47b-1d66-11eb-ad1d-12962e9c065b' of framework 
> a99f25dd-d176-4ffd-9351-e70a357c1872-0000 at executor(1)@10.100.5.141:36452
> I1103 01:48:08.291896  6542 slave.cpp:7848] Finished recovery
> {code}
>  
> Also, in mesos-slave.log I see a bunch of messages for services, which were 
> successfully restarted, like that:
>  
> {code:java}
> I1103 01:48:06.278928  6542 containerizer.cpp:854] Recovering container 
> 3f082a3f-88a1-4ca8-b92c-94d4c55f80d9 for executor 
> 'ia-cloud_worker-management-service.735fd7ad-1d67-11eb-ad1d-12962e9c065b' of 
> framework a99f25dd-d176-4ffd-9351-e70a357c1872-0000
> I1103 01:48:06.285398  6541 containerizer.cpp:3117] Container 
> 3f082a3f-88a1-4ca8-b92c-94d4c55f80d9 has exited
> I1103 01:48:06.285406  6541 containerizer.cpp:2576] Destroying container 
> 3f082a3f-88a1-4ca8-b92c-94d4c55f80d9 in RUNNING state
> I1103 01:48:06.285413  6541 containerizer.cpp:3278] Transitioning the state 
> of container 3f082a3f-88a1-4ca8-b92c-94d4c55f80d9 from RUNNING to DESTROYING
> I1103 01:48:06.311017  6541 launcher.cpp:161] Asked to destroy container 
> 3f082a3f-88a1-4ca8-b92c-94d4c55f80d9
> W1103 01:48:06.398953  6540 containerizer.cpp:2375] Ignoring update for 
> unknown container 3f082a3f-88a1-4ca8-b92c-94d4c55f80d9
> {code}
>  
> But for the problematic service messages are different (there's no event  
> that container has exited):
>  
>  
> {code:java}
> I1103 01:48:06.278280 6542 containerizer.cpp:854] Recovering container 
> 570461ae-ac26-445e-8cbf-42e366d5ee91 for executor 
> 'ia-cloud_nexus.f09fb47b-1d66-11eb-ad1d-12962e9c065b' of framework 
> a99f25dd-d176-4ffd-9351-e70a357c1872-0000
> I1103 01:48:08.292357 6542 containerizer.cpp:2576] Destroying container 
> 570461ae-ac26-445e-8cbf-42e366d5ee91 in RUNNING state
> I1103 01:48:08.292371 6542 containerizer.cpp:3278] Transitioning the state of 
> container 570461ae-ac26-445e-8cbf-42e366d5ee91 from RUNNING to DESTROYING
> I1103 01:48:08.292414 6542 launcher.cpp:161] Asked to destroy container 
> 570461ae-ac26-445e-8cbf-42e366d5ee91
> {code}
>  
>  
>  
> What could be the reason of such behavior and how to avoid it? If this 
> services' state is stuck somethere in agents' internal structures (metadata 
> file on disk, or something like that) - how could we cleanup this state?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (MESOS-10197) One of processes gets incorrect status after stopping and starting mesos-master and mesos-agent simultaneously

Reply via email to