[ https://issues.apache.org/jira/browse/MESOS-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexander Rukletsov updated MESOS-4315: --------------------------------------- Description: The Quota failover logic introduced with MESOS-3865 changes the master failover recovery significantly if at least one quota is set. Now, if upon recovery any previously set quota has been detected, the allocator enters recovery mode, during which the allocator does not issue offers. The recovery mode — and therefore offer suspension — ends when either: * a certain amount of agents reregisters (by default 80% of agents known before the failover), * a timeout expires (by default 10 minutes). We could also safely exit the recovery mode, once all quotas have been satisfied (i.e. all agents participating in satisfying quota have reconnected). For small clusters a large percentage of quota'ed resources this will not make too much difference compared to the existing rules. But for larger clusters this condition could be fulfilled much faster than the 80% condition. We should at least consider whether such condition is worth the added complexity. was: The Quota failover logic introduced with MESOS-3865 changes the the master failover recovery changes significantly if at least one quota is set. Now, if upon recovery any previously set quota have been detected, the allocator enters recovery mode, during which the allocator does not issue offers. The recovery mode — and therefore offer suspension — ends when either: * A certain amount of agents reregisters (by default 80% of agents known before the failover), * a timeout expires (by default 10 minutes). We could also safely exit the recovery mode, once all quota has been satisfied (i.e. all agents participating in satisfying quota have reconnected). For small clusters a large percentage of quota'ed resources this will not make too much difference compared to the existing rules. But for larger clusters this condition could be fulfilled much faster than the 80% condition. We should at least consider whether such condition is worth the added complexity. > Improve Quota Failover Logic > ---------------------------- > > Key: MESOS-4315 > URL: https://issues.apache.org/jira/browse/MESOS-4315 > Project: Mesos > Issue Type: Improvement > Reporter: Joerg Schad > > The Quota failover logic introduced with MESOS-3865 changes the master > failover recovery significantly if at least one quota is set. > Now, if upon recovery any previously set quota has been detected, the > allocator enters recovery mode, during which the allocator does not issue > offers. The recovery mode — and therefore offer suspension — ends when either: > * a certain amount of agents reregisters (by default 80% of agents known > before the failover), > * a timeout expires (by default 10 minutes). > We could also safely exit the recovery mode, once all quotas have been > satisfied (i.e. all agents participating in satisfying quota have > reconnected). For small clusters a large percentage of quota'ed resources > this will not make too much difference compared to the existing rules. But > for larger clusters this condition could be fulfilled much faster than the > 80% condition. > We should at least consider whether such condition is worth the added > complexity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)