Benjamin Mahler created MESOS-450:
-------------------------------------
Summary: The master should shut down slaves upon removal.
Key: MESOS-450
URL: https://issues.apache.org/jira/browse/MESOS-450
Project: Mesos
Issue Type: Bug
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler
The following scenario was observed in a production environment.
1. The slave terminated due to a machine reboot. The last message from the
slave prior to dying:
I0419 20:04:47.483470 38030 cgroups_isolation_module.cpp:774] Updated
'memory.soft_limit_in_bytes' to 134217728 for <executor> of framework
<framework>
2. The master then deactivated the slave due to a health check failure:
W0419 20:06:15.952864 28740 master.cpp:1172] Removing slave
201303271650-1944527370-5050-24955-2642 at <slave> because it has been
deactivated
I0419 20:06:15.956354 28740 master.cpp:1181] Master now considering a slave
at <slave> as inactive
3. The slave came back up and registered 4 minutes later:
I0419 20:10:19.693114 8181 main.cpp:131] Starting Mesos slave
I0419 20:10:31.314707 8201 slave.cpp:398] Registered with master; given
slave ID 201304011727-2230002186-5050-28738-377
4. The master added the slave correctly (the slave had some registration
retries):
I0419 20:10:22.171288 28742 master.cpp:924] Attempting to register slave on
<slave> at slave(1)@<ip>
I0419 20:10:22.171336 28742 master.cpp:1159] Master now considering a slave
at <slave> as active
I0419 20:10:22.171361 28742 master.cpp:1735] Adding slave
201304011727-2230002186-5050-28738-377 at <slave> with cpus=22; mem=21962;
ports=[31000-32000]; disk=400000
I0419 20:10:22.171661 28739 hierarchical_allocator_process.hpp:393] Added
slave 201304011727-2230002186-5050-28738-377 (<slave>) with cpus=22; mem=21962;
ports=[31000-32000]; disk=400000 (and cpus=22; mem=21962; ports=[31000-32000];
disk=400000 available)
...
I0419 20:10:23.171977 28754 master.cpp:913] Slave
201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending
acknowledgement
...
I0419 20:10:24.172746 28752 master.cpp:913] Slave
201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending
acknowledgement
...
I0419 20:10:25.173225 28747 master.cpp:913] Slave
201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending
acknowledgement
...
I0419 20:10:26.174988 28740 master.cpp:913] Slave
201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending
acknowledgement
...
I0419 20:10:27.176584 28739 master.cpp:913] Slave
201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending
acknowledgement
...
I0419 20:10:28.178421 28754 master.cpp:913] Slave
201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending
acknowledgement
...
I0419 20:10:29.179831 28740 master.cpp:913] Slave
201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending
acknowledgement
...
I0419 20:10:31.312396 28750 master.cpp:913] Slave
201304011727-2230002186-5050-28738-377 (<slave>) already registered, resending
acknowledgement
5. Then, ~6 minutes after the original slave went down, the master received an
exited event from the slave PID. It's highly likely this was a delayed signal
for the old slave:
I0419 20:10:31.313608 28750 master.cpp:518] Slave
201304011727-2230002186-5050-28738-377(<slave>) disconnected
This stems from the fact that we do not know the SlaveID for the exited event,
so we incorrectly remove the slave. One fix for this is to tell the slave to
shut down when we remove it.
Without the proposed fix, the slave remained in a state of thinking it is
registered, but the master disagrees.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira