[jira] [Created] (MESOS-450) The master should shut down slaves upon removal.

Benjamin Mahler (JIRA) Tue, 23 Apr 2013 18:40:14 -0700

Benjamin Mahler created MESOS-450:
-------------------------------------

             Summary: The master should shut down slaves upon removal.
                 Key: MESOS-450
                 URL: https://issues.apache.org/jira/browse/MESOS-450
             Project: Mesos
          Issue Type: Bug
            Reporter: Benjamin Mahler
            Assignee: Benjamin Mahler



The following scenario was observed in a production environment.

1. The slave terminated due to a machine reboot. The last message from the 
slave prior to dying:
    I0419 20:04:47.483470 38030 cgroups_isolation_module.cpp:774] Updated 
'memory.soft_limit_in_bytes' to 134217728 for <executor> of framework 
<framework>

2. The master then deactivated the slave due to a health check failure:
    W0419 20:06:15.952864 28740 master.cpp:1172] Removing slave 
201303271650-1944527370-5050-24955-2642 at <slave> because it has been 
deactivated
    I0419 20:06:15.956354 28740 master.cpp:1181] Master now considering a slave 
at <slave> as inactive

3. The slave came back up and registered 4 minutes later:
    I0419 20:10:19.693114  8181 main.cpp:131] Starting Mesos slave
    I0419 20:10:31.314707  8201 slave.cpp:398] Registered with master; given 
slave ID 201304011727-2230002186-5050-28738-377

4. The master added the slave correctly (the slave had some registration 
retries):
    I0419 20:10:22.171288 28742 master.cpp:924] Attempting to register slave on 
<slave> at slave(1)@<ip>
    I0419 20:10:22.171336 28742 master.cpp:1159] Master now considering a slave 
at <slave> as active
    I0419 20:10:22.171361 28742 master.cpp:1735] Adding slave 
201304011727-2230002186-5050-28738-377 at <slave> with cpus=22; mem=21962; 
ports=[31000-32000]; disk=400000
    I0419 20:10:22.171661 28739 hierarchical_allocator_process.hpp:393] Added 
slave 201304011727-2230002186-5050-28738-377 (<slave>) with cpus=22; mem=21962; 
ports=[31000-32000]; disk=400000 (and cpus=22; mem=21962; ports=[31000-32000]; 
disk=400000 available)
...
    I0419 20:10:23.171977 28754 master.cpp:913] Slave 
201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending 
acknowledgement
    ...
    I0419 20:10:24.172746 28752 master.cpp:913] Slave 
201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending 
acknowledgement
    ...
    I0419 20:10:25.173225 28747 master.cpp:913] Slave 
201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending 
acknowledgement
    ...
    I0419 20:10:26.174988 28740 master.cpp:913] Slave 
201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending 
acknowledgement
    ...
    I0419 20:10:27.176584 28739 master.cpp:913] Slave 
201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending 
acknowledgement
    ...
    I0419 20:10:28.178421 28754 master.cpp:913] Slave 
201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending 
acknowledgement
    ...
    I0419 20:10:29.179831 28740 master.cpp:913] Slave 
201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending 
acknowledgement
    ...
    I0419 20:10:31.312396 28750 master.cpp:913] Slave 
201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending 
acknowledgement

5. Then, ~6 minutes after the original slave went down, the master received an 
exited event from the slave PID. It's highly likely this was a delayed signal 
for the old slave:
    I0419 20:10:31.313608 28750 master.cpp:518] Slave 
201304011727-2230002186-5050-28738-377(<slave>) disconnected


This stems from the fact that we do not know the SlaveID for the exited event, 
so we incorrectly remove the slave. One fix for this is to tell the slave to 
shut down when we remove it.

Without the proposed fix, the slave remained in a state of thinking it is 
registered, but the master disagrees.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (MESOS-450) The master should shut down slaves upon removal.

Reply via email to