[
https://issues.apache.org/jira/browse/MESOS-682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Benjamin Mahler resolved MESOS-682.
-----------------------------------
Resolution: Fixed
Ok, reading closely through what happened here, this situation will be
prevented when using a Master with a strict registry:
{quote}
The surprising bit is that at 17:08:32 it got another re-register message
(probably backed up somewhere in the network?) from the same slave with the old
pid 201308072143-2082809866-5050-35234-5186. Since this id doesn't exist in the
master's slaves map, master thought it was a new slave and added it.
{quote}
The readmission attempt here would have been rejected.
> Master should properly consolidate "slaves" and "deactivated" maps
> ------------------------------------------------------------------
>
> Key: MESOS-682
> URL: https://issues.apache.org/jira/browse/MESOS-682
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 0.13.0, 0.14.0
> Reporter: Vinod Kone
> Assignee: Benjamin Mahler
> Labels: twitter
> Fix For: 0.19.0
>
>
> Currently, the master keeps track of active slaves with "slaves" map and
> deactivated slaves with "deactivated" map. While the former is indexed on
> SlaveID the latter is index on pid. This could lead to inconsistencies
> regarding the state of the slaves.
> We have seen this in production at Twitter.
> Slave was given id 201308072143-2082809866-5050-35234-5186 at 16:35:59. After
> ~22 minutes master removed the slave, presumably because of network
> partition. The slave received shutdown and restarted at 17:08:01. It then
> registered with the master at 17:08:31 and got a new id
> 201308072143-2082809866-5050-35234-5193. But then it was immediately
> considered "disconnected" (not sure why) by the master and removed. When the
> slave came back up it got yet another pid
> 201308072143-2082809866-5050-35234-5194.
> The surprising bit is that at 17:08:32 it got another re-register message
> (probably backed up somewhere in the network?) from the same slave with the
> old pid 201308072143-2082809866-5050-35234-5186. Since this id doesn't exist
> in the master's slaves map, master thought it was a new slave and added it.
> When the slave got the ack for this re-registration message it committed
> suicide (as expected) because the id it received was un-expected. Now the
> master removed the slave with id 201308072143-2082809866-5050-35234-5186 from
> its slaves map based on the pid. Note that was completely arbitrary, because
> the master could just as well have removed the slave id
> 201308072143-2082809866-5050-35234-5194 from its map. This is because the
> master just loops through all entries in "slaves" and picks the first one
> that matches the pid.
> At this point the slave's pid was added to "deactivated" but there exists a
> slave (201308072143-2082809866-5050-35234-5194) in the slaves map with the
> same pid!
> When it eventually received a status update from the slave, the master
> crashed (as expected) because the message was from a slave whose pid is in
> "deactivated" but present in "slaves".
--
This message was sent by Atlassian JIRA
(v6.2#6252)