Benjamin Mahler created MESOS-1058:
--------------------------------------

             Summary: Master CHECK failure: 
hierarchical_allocator_process.hpp:421 Check failed: !slaves.contains(slaveId)
                 Key: MESOS-1058
                 URL: https://issues.apache.org/jira/browse/MESOS-1058
             Project: Mesos
          Issue Type: Bug
          Components: master, slave
    Affects Versions: 0.17.0, 0.18.0
            Reporter: Benjamin Mahler
            Assignee: Benjamin Mahler
             Fix For: 0.19.0


We've observed this CHECK failure in production when the following situation 
occurs:

1. Slave asks to Register with Master.
2. Master adds slave with ID 1 and sends acknowledgment.
3. Acknowledgement to the slave is dropped due to one-way partition.
4. Slave continues to retry.
5. Master detects socket closure on slave, marks slave as disconnected.
6. Slave did not exit, re-detects Master, and asks to Register.
7. Master::registerSlave decides to remove "old disconnected slave".
BUG: Master::removeSlave does not remove the old slave from the allocator!
8. Master::registerSlave adds slave with ID 2 and sends acknowledgement.
9. Slave receives ID 1 acknowledgement, and checkpoints.
10. Slave receives ID 2 acknowledgement, and exits from mismatch.
11. Slave recovers and attempts to re-register with checkpointed ID 1.
12. Master allows this (no Registrar yet), and attempts to add the slave to the 
allocator (because of BUG above, CHECK fails in the allocator).

The first bug here is that the Master does not remove a slave from the 
allocator in Master::removeSlave if the slave is disconnected! This was likely 
a regression when Allocator::slaveDisconnected was introduced, and we neglected 
to make the necessary update to Master::removeSlave. This is an easy fix.

The second bug is that the Slave's ID was inconsistent with the Master, and the 
slave exited, only to re-register with the inconsistent ID. If the above bug is 
fixed, this means we'll allow the slave to re-register in the Master after 
having told frameworks the slave is lost. I'm tempted to punt on this bug since 
with the Registrar, this situation would be prevented as the re-registration 
would be denied. Also, we already expose this edge-case slave inconsistency to 
frameworks in other situations without the Registrar.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to