[ https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17020545#comment-17020545 ]
Benjamin Mahler commented on MESOS-10068: ----------------------------------------- The first thing to comment on is that we don't yet have a formalized agent lifecycle in the API, we have AgentAdded / AgentRemoved but internally there is also the notion of disconnecting, becoming unreachable, getting transitioned to gone. So the API and internals are at a bit of a mismatch here and more broadly of this particular ticket we would need to make them consistent to have events that make sense. [~daltonmatos] It looks like the reason you're seeing no AGENT_REMOVED is that the the agent became unreachable, and we don't send it in that case. The first case goes through a different path where we never were able to communicate with the agent, but we don't know that and the agent retries its registration, upon seeing this we remove the previous version of that agent and try to register the new one. You may see this repeating itself over and over. [~greggomann] looks like we don't send AGENT_REMOVED when an agent is marked as gone? Seems like a bug due to {{__removeSlave}} being used for both marking unreachable and gone? > Mesos Master doesn't send AGENT_REMOVED when removing agent from internal > state > ------------------------------------------------------------------------------- > > Key: MESOS-10068 > URL: https://issues.apache.org/jira/browse/MESOS-10068 > Project: Mesos > Issue Type: Bug > Components: master > Affects Versions: 1.7.3, 1.8.2, 1.9.1 > Reporter: Dalton Matos Coelho Barreto > Priority: Major > Attachments: master-full-logs.log > > > Hello, > > Looking at the documentation of the master {{/api/v1}} endpoint, the > {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is > supported for this endpoint, but when a new agent joins the cluster a > {{AGENT_ADDED}} event is received. > The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not > received by clients subscribed to the master API. > > I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All > using the docker image {{mesos/mesos-centos}}. > The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the > cluster but the master couldn't communicate with this agent, in this specific > test there was a firewall blocking port {{5051}} on the slave, that is, no > body was being able to tal to the slave on port {{5051}}. > > h2. Here are the steps do reproduce the problem > * Start a new mesos master > * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message: > ** > {noformat} > curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: > application/json" http://MASTER_IP:5050/api/v1{noformat} > * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered; > * Stop this slave; > * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the > field {{active=false}}. > * Waits for mesos master stop listing this slave, that is, > {{/slaves?slave_id=AGENT_ID}} returns an empty response; > Even after the empty response, the event never reaches the subscriber. > > The mesos master logs shows this: > {noformat} > I1213 15:03:10.338935 13 master.cpp:1297] Agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) disconnected > I1213 15:03:10.339089 13 master.cpp:3399] Disconnecting agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > I1213 15:03:10.339207 13 master.cpp:3418] Deactivating agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > {noformat} > And then: > {noformat} > W1213 15:04:40.726670 15 process.cpp:1917] Failed to send > 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to > connect to 172.18.0.51:5051: No route to host{noformat} > And some time after this: > {noformat} > I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 {noformat} > > Even after this removal, the {{AGENT_REMOVED}} event is not delivered. > > I will attach the full master logs also. > > Do you think this could be a bug? -- This message was sent by Atlassian Jira (v8.3.4#803005)