[ https://issues.apache.org/jira/browse/MESOS-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938577#comment-13938577 ]
Benjamin Mahler commented on MESOS-682: --------------------------------------- Linking in the Registrar ticket, which should fix this issue. Technically, with the Registrar we do not need a 'deactivated' UPID map, but I will take a closer look at what happened in this ticket to ensure it will behave correctly with a strict Registrar. > Master should properly consolidate "slaves" and "deactivated" maps > ------------------------------------------------------------------ > > Key: MESOS-682 > URL: https://issues.apache.org/jira/browse/MESOS-682 > Project: Mesos > Issue Type: Bug > Affects Versions: 0.13.0, 0.14.0 > Reporter: Vinod Kone > Assignee: Benjamin Mahler > Labels: twitter > Fix For: 0.19.0 > > > Currently, the master keeps track of active slaves with "slaves" map and > deactivated slaves with "deactivated" map. While the former is indexed on > SlaveID the latter is index on pid. This could lead to inconsistencies > regarding the state of the slaves. > We have seen this in production at Twitter. > Slave was given id 201308072143-2082809866-5050-35234-5186 at 16:35:59. After > ~22 minutes master removed the slave, presumably because of network > partition. The slave received shutdown and restarted at 17:08:01. It then > registered with the master at 17:08:31 and got a new id > 201308072143-2082809866-5050-35234-5193. But then it was immediately > considered "disconnected" (not sure why) by the master and removed. When the > slave came back up it got yet another pid > 201308072143-2082809866-5050-35234-5194. > The surprising bit is that at 17:08:32 it got another re-register message > (probably backed up somewhere in the network?) from the same slave with the > old pid 201308072143-2082809866-5050-35234-5186. Since this id doesn't exist > in the master's slaves map, master thought it was a new slave and added it. > When the slave got the ack for this re-registration message it committed > suicide (as expected) because the id it received was un-expected. Now the > master removed the slave with id 201308072143-2082809866-5050-35234-5186 from > its slaves map based on the pid. Note that was completely arbitrary, because > the master could just as well have removed the slave id > 201308072143-2082809866-5050-35234-5194 from its map. This is because the > master just loops through all entries in "slaves" and picks the first one > that matches the pid. > At this point the slave's pid was added to "deactivated" but there exists a > slave (201308072143-2082809866-5050-35234-5194) in the slaves map with the > same pid! > When it eventually received a status update from the slave, the master > crashed (as expected) because the message was from a slave whose pid is in > "deactivated" but present in "slaves". -- This message was sent by Atlassian JIRA (v6.2#6252)