[ 
https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020545#comment-17020545
 ] 

Benjamin Mahler commented on MESOS-10068:
-----------------------------------------

The first thing to comment on is that we don't yet have a formalized agent 
lifecycle in the API, we have AgentAdded / AgentRemoved but internally there is 
also the notion of disconnecting, becoming unreachable, getting transitioned to 
gone. So the API and internals are at a bit of a mismatch here and more broadly 
of this particular ticket we would need to make them consistent to have events 
that make sense.

[~daltonmatos] It looks like the reason you're seeing no AGENT_REMOVED is that 
the the agent became unreachable, and we don't send it in that case. The first 
case goes through a different path where we never were able to communicate with 
the agent, but we don't know that and the agent retries its registration, upon 
seeing this we remove the previous version of that agent and try to register 
the new one. You may see this repeating itself over and over.

[~greggomann] looks like we don't send AGENT_REMOVED when an agent is marked as 
gone? Seems like a bug due to {{__removeSlave}} being used for both marking 
unreachable and gone?



> Mesos Master doesn't send AGENT_REMOVED when removing agent from internal 
> state
> -------------------------------------------------------------------------------
>
>                 Key: MESOS-10068
>                 URL: https://issues.apache.org/jira/browse/MESOS-10068
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.7.3, 1.8.2, 1.9.1
>            Reporter: Dalton Matos Coelho Barreto
>            Priority: Major
>         Attachments: master-full-logs.log
>
>
> Hello,
>  
> Looking at the documentation of the master {{/api/v1}} endpoint, the 
> {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is 
> supported for this endpoint, but when a new agent joins the cluster a 
> {{AGENT_ADDED}} event is received.
> The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not 
> received by clients subscribed to the master API.
>  
> I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All 
> using the docker image {{mesos/mesos-centos}}.
> The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the 
> cluster but the master couldn't communicate with this agent, in this specific 
> test there was a firewall blocking port {{5051}} on the slave, that is, no 
> body was being able to tal to the slave on port {{5051}}.
>  
> h2. Here are the steps do reproduce the problem
>  * Start a new mesos master
>  * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message:
>  ** 
> {noformat}
> curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: 
> application/json" http://MASTER_IP:5050/api/v1{noformat}
>  * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered;
>  * Stop this slave;
>  * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the 
> field {{active=false}}.
>  * Waits for mesos master stop listing this slave, that is, 
> {{/slaves?slave_id=AGENT_ID}} returns an empty response;
> Even after the empty response, the event never reaches the subscriber.
>  
> The mesos master logs shows this:
> {noformat}
>  I1213 15:03:10.338935    13 master.cpp:1297] Agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964) disconnected
> I1213 15:03:10.339089    13 master.cpp:3399] Disconnecting agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> I1213 15:03:10.339207    13 master.cpp:3418] Deactivating agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 
> (86813ca2a964)
> {noformat}
> And then:
> {noformat}
> W1213 15:04:40.726670    15 process.cpp:1917] Failed to send 
> 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to 
> connect to 172.18.0.51:5051: No route to host{noformat}
> And some time after this:
> {noformat}
> I1213 15:04:37.685007     7 hierarchical.cpp:900] Removed agent 
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1   {noformat}
>  
> Even after this removal, the {{AGENT_REMOVED}} event is not delivered.
>  
> I will attach the full master logs also.
>  
> Do you think this could be a bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to