[ 
https://issues.apache.org/jira/browse/MESOS-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938577#comment-13938577
 ] 

Benjamin Mahler commented on MESOS-682:
---------------------------------------

Linking in the Registrar ticket, which should fix this issue. 

Technically, with the Registrar we do not need a 'deactivated' UPID map, but I 
will take a closer look at what happened in this ticket to ensure it will 
behave correctly with a strict Registrar.

> Master should properly consolidate "slaves" and "deactivated" maps
> ------------------------------------------------------------------
>
>                 Key: MESOS-682
>                 URL: https://issues.apache.org/jira/browse/MESOS-682
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.13.0, 0.14.0
>            Reporter: Vinod Kone
>            Assignee: Benjamin Mahler
>              Labels: twitter
>             Fix For: 0.19.0
>
>
> Currently, the master keeps track of active slaves with "slaves" map and 
> deactivated slaves with "deactivated" map. While the former is indexed on 
> SlaveID the latter is index on pid. This could lead to inconsistencies 
> regarding the state of the slaves.
> We have seen this in production at Twitter. 
> Slave was given id 201308072143-2082809866-5050-35234-5186 at 16:35:59. After 
> ~22 minutes master removed the slave, presumably because of network 
> partition. The slave received shutdown and restarted at 17:08:01. It then 
> registered with the master at 17:08:31 and got a new id 
> 201308072143-2082809866-5050-35234-5193. But then it was immediately 
> considered "disconnected" (not sure why) by the master and removed. When the 
> slave came back up it got yet another pid 
> 201308072143-2082809866-5050-35234-5194.
> The surprising bit is that at 17:08:32 it got another re-register message 
> (probably backed up somewhere in the network?) from the same slave with the 
> old pid 201308072143-2082809866-5050-35234-5186. Since this id doesn't exist 
> in the master's slaves map, master thought it was a new slave and added it. 
> When the slave got the ack for this re-registration message it committed 
> suicide (as expected) because the id it received was un-expected. Now the 
> master removed the slave with id 201308072143-2082809866-5050-35234-5186 from 
> its slaves map based on the pid. Note that was completely arbitrary, because 
> the master could just as well have removed the slave id 
> 201308072143-2082809866-5050-35234-5194 from its map. This is because the 
> master just loops through all entries in "slaves" and picks the first one 
> that matches the pid.
> At this point the slave's pid was added to "deactivated" but there exists a 
> slave (201308072143-2082809866-5050-35234-5194) in the slaves map with the 
> same pid!
> When it eventually received a status update from the slave, the master 
> crashed (as expected) because the message was from a slave whose pid is in 
> "deactivated" but present in "slaves".



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to