[
https://issues.apache.org/jira/browse/MESOS-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023162#comment-17023162
]
Dalton Matos Coelho Barreto commented on MESOS-10068:
-----------------------------------------------------
Hello [~greggomann],
Thanks for taking your time to answer this ticket.
About the time to dedicate to fix this bug, I undestand. In fact I would like
to ask if you (and [~bmahler] or any others) are willing to mentor a new
developer into the world of the mesos project codebase. I studied the code some
time ago (because of the ticket MESOS-8517) but didn't manage to contribute
with any code at that time.
About the new ticket you created to fix what I reported here, do you think it's
better do close this ticket and mention it on the other (MESOS-10089)?
I'm already watching MESOS-9556 so if I have any new suggestion or the ticket
has any new information I will post there.
Thanks.
> Mesos Master doesn't send AGENT_REMOVED when removing agent from internal
> state
> -------------------------------------------------------------------------------
>
> Key: MESOS-10068
> URL: https://issues.apache.org/jira/browse/MESOS-10068
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 1.7.3, 1.8.2, 1.9.1
> Reporter: Dalton Matos Coelho Barreto
> Priority: Major
> Attachments: master-full-logs.log
>
>
> Hello,
>
> Looking at the documentation of the master {{/api/v1}} endpoint, the
> {{SUBSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is
> supported for this endpoint, but when a new agent joins the cluster a
> {{AGENT_ADDED}} event is received.
> The problem is that when this agent is stopped the {{AGENT_REMOVED}} is not
> received by clients subscribed to the master API.
>
> I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}. All
> using the docker image {{mesos/mesos-centos}}.
> The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined the
> cluster but the master couldn't communicate with this agent, in this specific
> test there was a firewall blocking port {{5051}} on the slave, that is, no
> body was being able to tal to the slave on port {{5051}}.
>
> h2. Here are the steps do reproduce the problem
> * Start a new mesos master
> * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message:
> **
> {noformat}
> curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type:
> application/json" http://MASTER_IP:5050/api/v1{noformat}
> * Start a new slave and confirm the {{AGENT_ADDED}} event is delivered;
> * Stop this slave;
> * Checks that {{/slaves?slave_id=AGENT_ID}} returns a JSON response with the
> field {{active=false}}.
> * Waits for mesos master stop listing this slave, that is,
> {{/slaves?slave_id=AGENT_ID}} returns an empty response;
> Even after the empty response, the event never reaches the subscriber.
>
> The mesos master logs shows this:
> {noformat}
> I1213 15:03:10.338935 13 master.cpp:1297] Agent
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051
> (86813ca2a964) disconnected
> I1213 15:03:10.339089 13 master.cpp:3399] Disconnecting agent
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051
> (86813ca2a964)
> I1213 15:03:10.339207 13 master.cpp:3418] Deactivating agent
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051
> (86813ca2a964)
> {noformat}
> And then:
> {noformat}
> W1213 15:04:40.726670 15 process.cpp:1917] Failed to send
> 'mesos.internal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to
> connect to 172.18.0.51:5051: No route to host{noformat}
> And some time after this:
> {noformat}
> I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent
> 2cd23025-c09d-401b-8f26-9265eda8f800-S1 {noformat}
>
> Even after this removal, the {{AGENT_REMOVED}} event is not delivered.
>
> I will attach the full master logs also.
>
> Do you think this could be a bug?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)