[ https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026115#comment-17026115 ]
Dalton Matos Coelho Barreto commented on MESOS-10068: ----------------------------------------------------- Thanks for you availability [~greggomann], even with a very limited time. I appreciate it. I will try to organize myself so I can dedicate somte time to the project and then when (if) I have a better undestanting of this part of the code I will can reach you and we can talk in more detail about this issue. Thanks. > Mesos Master doesn't send AGENT_REMOVED when removing agent from internal > state > ------------------------------------------------------------------------------- > > Key: MESOS-10068 > URL: https://issues.apache.org/jira/browse/MESOS-10068 > Project: Mesos > Issue Type: Bug > Components: master > Affects Versions: 1.7.3, 1.8.2, 1.9.1 > Reporter: Dalton Matos Coelho Barreto > Priority: Major > Attachments: master-full-logs.log > > > Hello, > > Looking at the documentation of the master {{/api/v1}} endpoint, the > {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is > supported for this endpoint, but when a new agent joins the cluster a > {{AGENT_ADDED}} event is received. > The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not > received by clients subscribed to the master API. > > I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All > using the docker image {{mesos/mesos-centos}}. > The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the > cluster but the master couldn't communicate with this agent, in this specific > test there was a firewall blocking port {{5051}} on the slave, that is, no > body was being able to tal to the slave on port {{5051}}. > > h2. Here are the steps do reproduce the problem > * Start a new mesos master > * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message: > ** > {noformat} > curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: > application/json" http://MASTER_IP:5050/api/v1{noformat} > * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered; > * Stop this slave; > * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the > field {{active=false}}. > * Waits for mesos master stop listing this slave, that is, > {{/slaves?slave_id=AGENT_ID}} returns an empty response; > Even after the empty response, the event never reaches the subscriber. > > The mesos master logs shows this: > {noformat} > I1213 15:03:10.338935 13 master.cpp:1297] Agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) disconnected > I1213 15:03:10.339089 13 master.cpp:3399] Disconnecting agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > I1213 15:03:10.339207 13 master.cpp:3418] Deactivating agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 > (86813ca2a964) > {noformat} > And then: > {noformat} > W1213 15:04:40.726670 15 process.cpp:1917] Failed to send > 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to > connect to 172.18.0.51:5051: No route to host{noformat} > And some time after this: > {noformat} > I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent > 2cd23025-c09d-401b-8f26-9265eda8f800-S1 {noformat} > > Even after this removal, the {{AGENT_REMOVED}} event is not delivered. > > I will attach the full master logs also. > > Do you think this could be a bug? -- This message was sent by Atlassian Jira (v8.3.4#803005)